CN112256880A - Text recognition method and device, storage medium and electronic equipment - Google Patents

Text recognition method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112256880A
CN112256880A CN202011255568.XA CN202011255568A CN112256880A CN 112256880 A CN112256880 A CN 112256880A CN 202011255568 A CN202011255568 A CN 202011255568A CN 112256880 A CN112256880 A CN 112256880A
Authority
CN
China
Prior art keywords
text
cluster
target
vector
text cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011255568.XA
Other languages
Chinese (zh)
Inventor
何方
刘卓
胡少锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011255568.XA priority Critical patent/CN112256880A/en
Publication of CN112256880A publication Critical patent/CN112256880A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text recognition method and device, a storage medium and electronic equipment. Wherein, the method comprises the following steps: acquiring a target word list corresponding to a target text to be recognized; generating a target text vector corresponding to the target text by using the target word list; searching a target text cluster matched with a target text vector in a stored text cluster set, wherein the text cluster set is updated regularly and comprises an object text cluster configured with a text type label, and the text type label is used for indicating that the object text cluster is a malicious cluster or a non-malicious cluster; and determining the target text as the abnormal text under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster. The invention solves the technical problem of low text recognition efficiency caused by the fact that junk texts cannot be found in time due to offline processing.

Description

Text recognition method and device, storage medium and electronic equipment
Technical Field
The invention relates to the field of internet, in particular to a text recognition method and device, a storage medium and electronic equipment.
Background
In many application scenarios, an interaction process is often required to be completed by means of text information, for example, the text information here may be an out-link forwarding text, a shared text in a community shared space, a comment text, a call text used when an account relationship is bound, and the like.
However, some illegal subjects often push spam text messages on a large scale to achieve some malicious purpose, for example, a large amount of false red packet text messages with propagation inducing property appear in the forwarding text of an out-link forwarding scene, a large amount of gambling related text appears in the conversation text of a group conversation scene, fifteen types of related malicious text appears in a shared text, a large amount of drainage related text appears in a comment text, and a large amount of harassing content appears in a call text. Wherein, the junk text information has the following characteristics: the contents are highly similar; the number is enormous.
For the above junk text, a current common recognition method is to calculate text similarity between the text to be recognized and the known junk text through a traditional hash algorithm, and directly execute a blocking process on a source end body sending the text to be recognized under the condition that the text to be recognized is determined to be similar to the known junk text. The text similarity calculated by the traditional hash algorithm in an off-line state is poor in precision, and junk texts cannot be found in time due to off-line processing, so that the problem of low text recognition efficiency is caused.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a text recognition method and device, a storage medium and electronic equipment, and aims to at least solve the technical problem of low text recognition efficiency caused by the fact that junk texts cannot be found in time due to offline processing.
According to an aspect of an embodiment of the present invention, there is provided a text recognition method including: acquiring a target word list corresponding to a target text to be recognized; generating a target text vector corresponding to the target text by using the target word list; searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated periodically, and the text cluster set comprises an object text cluster configured with a text type label, and the text type label is used for indicating that the object text cluster is a malicious cluster or a non-malicious cluster; and determining the target text as an abnormal text under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster.
According to another aspect of the embodiments of the present invention, there is also provided a text recognition apparatus, including: the acquisition unit is used for acquiring a target word list corresponding to a target text to be recognized; the generating unit is used for generating a target text vector corresponding to the target text by using the target word list; a searching unit, configured to search a target text cluster matched with the target text vector from a stored text cluster set, where the text cluster set is to be updated periodically, and the text cluster set includes an object text cluster configured with a text type tag, where the text type tag is used to indicate that the object text cluster is a malicious cluster or a non-malicious cluster; and the identification unit is used for determining the target text as an abnormal text under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster.
According to a further aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above text recognition method when running.
According to still another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the text recognition method through the computer program.
In the embodiment of the invention, after a target word list corresponding to a target text to be recognized is obtained, a target text vector corresponding to the target text is generated by using the target word list. And then searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated regularly, the text cluster set comprises an object text cluster configured with a text type label, and the text type label is used for indicating whether the object text cluster is a malicious cluster or not. And if the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates a malicious cluster, identifying and determining the target text as an abnormal text. That is to say, after a target text vector corresponding to a target text to be recognized is obtained, whether a target text cluster corresponding to the target text vector of the target text exists or not is searched in a text cluster set which is updated regularly and is configured with a text type label, so that the text type of the target text is recognized by directly using the text cluster of the determined text type, and off-line recognition is not required to be performed on each text separately, the purpose of improving text recognition efficiency is achieved, and the problem of low text recognition efficiency caused by the fact that junk texts cannot be found in time due to off-line processing in the related art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a diagram of a hardware environment for an alternative text recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a hardware environment for an alternative text recognition method according to an embodiment of the present invention;
FIG. 3 is a flow diagram of an alternative text recognition method according to an embodiment of the present invention;
FIG. 4 is a diagram of recognition logic architecture for an alternative text recognition method according to an embodiment of the present invention;
FIG. 5 is a flow diagram of an alternative text recognition method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of an alternative text recognition method according to an embodiment of the present invention;
FIG. 7 is a flow diagram of yet another alternative text recognition method according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of an alternative text recognition method according to an embodiment of the present invention;
FIG. 9 is a flow diagram of yet another alternative text recognition method according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of yet another alternative text recognition method according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of an alternative text recognition apparatus according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the text recognition method provided by the embodiment of the present application will refer to the following technical terms:
natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
Text vector: semantic information of the text is extracted and expressed in a vector form, and the text is replaced by the vector during algorithm processing.
Real-time text clustering: similar texts are aggregated in the real-time text data stream to form a cluster structure.
Bidirectional Encoder representation model (Bert): the method is a natural language processing model based on Transformer modeling.
Hierarchical Navigable Small World graphs (HNSW for short): is a data structure for the nearest neighbor similarity search.
Key-Value (kv for short) distributed storage system has fast query speed, large storage data volume, high support for concurrency, is suitable for query through a main Key, but cannot perform complex condition query.
According to an aspect of the embodiments of the present invention, there is provided a text recognition method, optionally, as an optional implementation manner, the text recognition method may be applied, but not limited, to a text recognition system in a hardware environment as shown in fig. 1, where the text recognition system may include, but is not limited to, a terminal device 102, a network 104, and a server 106. Here, the terminal device 102 includes a human-machine interaction screen 1022, a processor 1024, and a memory 1026. The human-computer interaction screen 1022 is used to acquire target text to be recognized through a human-computer interaction interface. The processor 1024 is configured to encode and pre-process a target text to be recognized, and the memory 1026 is configured to store the target text to be recognized.
In addition, the server 106 includes a database 1062 and a processing engine 1064, where the database 1062 is used for storing the text cluster set. The processing engine 1064 is configured to process the target text received from the terminal device 102, and search a target text cluster corresponding to a target text vector of the target text in the text cluster set.
As an alternative embodiment, it is assumed that the text recognition process provided in the present embodiment is performed on the comment text in the comment scene. The specific process may be as steps S102-S104, obtaining a target text to be recognized in the terminal device 102, and sending the target text to the server 106 through the network 104. The server 106 will then perform the following operations:
s106, performing text preprocessing on the target text to obtain a target word list;
s108, generating a target text vector corresponding to the target text by using the target word list;
s110, searching a target text cluster matched with the target text vector in a text cluster set stored in a database through a processing engine 1064;
and S112, under the condition that the target text cluster is found and the text type label corresponding to the target text cluster indicates a malicious cluster, determining the target text as an abnormal text.
Optionally, the recognition result of the target text may also be returned to the terminal device 102 through a network as in step S114, where the interaction process shown in fig. 1 is taken as an example, and the recognition result may also be directly saved in the server so as to be referred to and used in the next recognition process. And are not limited herein.
As an alternative implementation, it is assumed that the text recognition process provided in this embodiment is still performed on the comment text in the comment scene by using the respective devices in the text recognition system shown in fig. 1 described above. The specific process may be as in steps S202-S206, obtaining a target text to be identified in the terminal device 102, and performing text preprocessing on the target text in the terminal device 102 to obtain a target word list; and generating a target text vector corresponding to the target text by using the target word list. The target text is then sent to the server 106 via the network 104 as in step S208. The server 106 will then perform the following operations:
s210, searching a target text cluster matched with the target text vector in a text cluster set stored in a database through a processing engine 1064;
s212, under the condition that the target text cluster is found and the text type label corresponding to the target text cluster indicates a malicious cluster, determining the target text as an abnormal text.
Optionally, as step S214, the recognition result of the target text may also be returned to the terminal device 102 through a network, where the interaction process shown in fig. 2 is taken as an example, and the recognition result may also be directly saved in the server, so as to be referred to and used in the next recognition process. And are not limited herein.
That is to say, in this embodiment, the process of converting the target text to be recognized into the target text vector may be completed in the terminal device 102 or may be completed in the server 106, which is not limited herein.
It should be noted that, in this embodiment, after a target word list corresponding to a target text to be recognized is obtained, a target text vector corresponding to the target text is generated by using the target word list. And then searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated regularly, the text cluster set comprises an object text cluster configured with a text type label, and the text type label is used for indicating whether the object text cluster is a malicious cluster or not. And if the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates a malicious cluster, identifying and determining the target text as an abnormal text. That is to say, in a text cluster set which is periodically updated and configured with text type tags, whether a target text cluster corresponding to a target text exists is searched, so that the text type of the target text is identified directly by using the text cluster of the determined text type, and offline identification is not required to be performed on each text separately, thereby achieving the purpose of improving text identification efficiency, and further overcoming the problem of low text identification efficiency caused by the fact that junk texts cannot be found in time due to offline processing in the related art. In addition, under the condition of rapidly and efficiently identifying the malicious text, the method is favorable for purifying the environment of a network platform for displaying or sharing the content based on the content generated by the user, avoids information interference caused by a large amount of illegal text information to the browsing or watching process of the user, provides a sunshine network environment for the majority of users, and ensures the browsing safety of the users.
Optionally, in this embodiment, the terminal device may be a terminal device configured with a target client, and may include, but is not limited to, at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, smart televisions, etc. The target client may be a community shared space application client, an instant messaging session client, a video playback application client, a content distribution application client, and the like. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and this is not limited in this embodiment.
Optionally, as an optional implementation manner, as shown in fig. 3, the text recognition method includes:
s302, acquiring a target word list corresponding to a target text to be recognized;
s304, generating a target text vector corresponding to the target text by using the target word list;
s306, searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated regularly and comprises an object text cluster configured with a text type label, and the text type label is used for indicating that the object text cluster is a malicious cluster or a non-malicious cluster;
s308, under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster, the target text is determined to be abnormal text.
Optionally, in this embodiment, the text recognition method may be applied, but not limited to, in an application scenario where Content presentation or promotion is performed based on User Generated Content (UGC), where the UGC refers to presenting or providing Content originally created by itself to other users through an internet platform. Such as friend social networks, video/photo/knowledge sharing networks, communities/forums, etc. Optionally, the text information in the application scenario corresponding to the application scenario, to which the text recognition method provided in this embodiment is applied, may include, but is not limited to, at least one of the following: forwarding text information of an out-link forwarding scene, shared text information in a community shared space, comment text information on displayed content, and call text information used in account binding relationship (such as registration, friend account addition, and the like). The identification process is mainly to identify the junk texts which are issued by the user through the network platform, have high occurrence frequency and have repeated contents in different application scenes, such as text information sent by the user in a 'drift bottle' form, possibly having repeated contents, and even text contents related to illegal sensitive information (such as yellow gambling poison information). The junk texts are malicious texts which are not triggered by some malicious subjects to achieve a certain purpose, and cause great information interference to broad users who pay attention to contents such as web interfaces or videos and the like when the users normally browse the web interfaces, so that the network browsing environment of the broad users cannot be safely guaranteed.
Thus, in order to overcome the above problems, malicious text in the online traffic needs to be identified and cleaned up in a timely manner. However, in the related art, a common identification method is to perform offline identification on texts in a service flow by using a traditional hash algorithm, so that not only is the identification accuracy poor, but also the real-time performance is poor, and malicious texts appearing in online data cannot be found in time. In the embodiment of the application, after the target text vector corresponding to the target text to be recognized is obtained, whether the target text cluster corresponding to the target text vector exists or not is searched in the text cluster set which is updated regularly and is configured with the text type label, so that the text type of the target text is recognized by directly using the text cluster of the determined text type, and the text type does not need to be recognized separately for each text, thereby achieving the purpose of improving the text recognition efficiency and further overcoming the problem of low text recognition efficiency caused by the fact that junk texts cannot be found in time due to offline processing in the related art. In addition, under the condition of rapidly and efficiently identifying the malicious text, the method is favorable for purifying the environment of a network platform for displaying or sharing the content based on the content generated by the user, avoids information interference caused by a large amount of illegal text information to the browsing or watching process of the user, provides a sunshine network environment for the majority of users, and ensures the browsing safety of the users.
Optionally, in this embodiment, the text recognition method may be, but is not limited to, applied to a recognition logic architecture as shown in fig. 4, where the recognition logic architecture may include, but is not limited to: a text vector generation module 402, a vector search module 404, a real-time clustering module 406, and a label assignment module 408. The text vector generation module 402 is configured to convert a target word list of a target text into a target text vector; a text cluster set is stored in the vector search module 404, which is used for searching a target text cluster matched with the target text vector in the stored text cluster set; the real-time clustering module 406 is configured to, in a case that a target text cluster matched with the target text vector is not found in the stored text cluster set, perform clustering processing on the target text meeting the clustering condition to generate a new text cluster, and add the new text cluster meeting the addition condition to the text cluster set. The label assignment module 408 is used to assign a text type label to the new text cluster. It should be noted that, in this embodiment, for one text cluster, under the condition that any one of the texts is assigned with a text type label (marking), the purpose of marking all the texts in the text cluster in batch can be achieved without repeated operations, so that the effect of improving the marking efficiency is achieved.
Optionally, in this embodiment, the generating of the text vector corresponding to the target text by using the target word list may be implemented by, but not limited to, using a model obtained by distilling a model of a Bert-based language processing model. That is, the TinyBert-based model processed by the model distillation technique may be included in the text vector generation module, but is not limited thereto. The Bert-based language processing model is a super-large scale neural network, performs unsupervised learning by using mass corpus data, and can be applied to natural language processing tasks such as text classification, question answering, similarity, emotion analysis, sequence labeling and the like. In addition, the model structure of the distilled TinyBert is the same as that of the distilled TinyBert, but the number of network layers is reduced, and parameters participating in calculation in the network are reduced. Therefore, the size of the model is reduced, the processing of converting the text into the text vector is accelerated, and the text recognition efficiency is further improved.
Optionally, in this embodiment, the stored text cluster set may include, but is not limited to, a plurality of object text cluster clusters. The object text cluster may be a cluster structure formed by aggregating similar texts in a real-time text data stream. For each cluster structure, there exists a cluster center vector, and text vectors whose distance from the cluster center vector is smaller than a threshold value may be classified as the cluster structure, that is, it is determined that the text vector hits an object text cluster corresponding to the cluster structure, that is, the text vector is similar to each text included in the object text cluster.
Further, when a target text cluster matched with the target text vector is found in the stored text cluster set, the target text cluster in the text cluster set is hit by the target text, and the type label of the target text can be determined based on the text type label configured in the target text cluster set, so that the effect of quickly and efficiently identifying whether the target text is regarded as the junk text of the malicious text is achieved.
It should be noted that the text cluster set may be updated periodically, where an update period of the periodic update may be, but is not limited to, 1 week, 1 day, or 1 hour, and a specific period value may be, but is not limited to, set according to an actual scene. That is to say, by updating the text cluster set in time, the text can be further identified in real time, and the purpose of finding out malicious texts in time is achieved. In addition, in this embodiment, the update scenario of the text cluster set is an object text cluster obtained by acquiring a new cluster. That is to say, when the current target text does not find the target text cluster in the stored text cluster set, the target text may be temporarily cached first, and when the target text meets the clustering condition, a new object text cluster may be generated in real time by using the target text, and when the adding condition is met, the new object text cluster is added to the text cluster set, and the relevant attribute information is adaptively adjusted and updated.
Optionally, in this embodiment, the new object clustering module may assign a new cluster identifier (marking) to the new object clustering module obtained through the real-time clustering process. It should be noted that, in this embodiment, but not limited to, the top N hot object text cluster clusters in the current period may be pulled from the stored text cluster set at regular time (e.g., every hour), and one text may be selected from each of the N object text cluster clusters to be marked, so as to determine whether the type of the text in the object text cluster is spam text regarded as malicious text, where N is a positive integer.
In addition, in this embodiment, the above-mentioned identification logic architecture may further include, but is not limited to, a supervisory control module, and the supervisory control module is configured to: and analyzing and counting the text cluster set based on the analysis functions of some data analysis tools to obtain an analysis result, and displaying the analysis result in real time. For example, under the condition that the target text is identified as a malicious text, the content in the target text cluster where the target text is located can be tracked in real time, and a trend curve showing magnitude change of the target text cluster is drawn. Furthermore, after the trend curve is detected based on an anomaly detection algorithm, automatic early warning information can be sent out in time before the malicious text breaks out, and unnecessary loss to the user is avoided.
The description is made with reference to the example shown in fig. 5: it is assumed that the line part comprises a logic center 500 comprising pre-processing logic 500-2, reporting logic 500-4 and processing logic 500-6. Here, the online portion obtains a target text (as shown, a new text) to be recognized, and preprocesses the new text in the preprocessing logic 500-2 in the logic center 500, for example, the operations of the preprocessing include: word segmentation, removal of stop words, conversion from complex to simple, removal of special characters and the like to obtain a space-separated word list. The word list is then input into the text vector generation module 502, and the word list is converted into a target text vector through the distillation-processed model in the text vector generation module 502. Then, the vector search module 504 searches for a target text cluster matching the target text vector in the stored text cluster set, that is, determines whether the target text vector hits a cluster in the text cluster set.
And under the condition of determining hit, namely determining to find the target text cluster, determining the target text cluster through the index list, and acquiring the configured text type label to determine the type of the target text. Further, through the processing logic 500-6 in the logic center 500, corresponding processing is performed on the target text, and if the target text is determined to be a malicious text, a blocking process or an alarm process is performed on a source account or a source IP of the target text. And if the target text is determined to be the non-malicious text, allowing the target text to realize normal interaction.
In the case that a miss is determined, i.e., the target text cluster is not found, the real-time clustering module 506 may consider to perform real-time clustering on the target text. When the target text reaches the clustering condition, the target text is clustered, a new object text cluster is added to the text cluster set, and the information recorded in the index list 508 is updated. The new object text cluster is then assigned a text type label by the label assignment module 510 to identify the object text cluster.
In the process of clustering the target texts, the determining may include, but is not limited to, first determining an appearance frequency of the target texts in the device; under the condition that the occurrence frequency is greater than the threshold value, the text clusters are clustered into text cluster clusters, and the text cluster clusters are reported to the data analysis base 512 of the offline part through the reporting logic 500-4, so that the data analysis base 512 can aggregate new text cluster clusters reported by each device. And under the condition that the aggregation result indicates that the frequency of the new text cluster is greater than the threshold value, determining that the new text cluster is a new object text cluster to be added into the text cluster set.
In the above process, the monitoring control module 514 in the offline part will implement real-time monitoring to track the magnitude change of the malicious text and draw a trend curve showing the magnitude change. Furthermore, after the trend curve is detected based on an anomaly detection algorithm, automatic early warning information can be sent out in time before the malicious text breaks out, and unnecessary loss to the user is avoided.
According to the embodiment provided by the application, after the target text vector corresponding to the target text to be recognized is obtained, whether the target text cluster corresponding to the target text vector exists or not is searched in the text cluster set which is updated regularly and is configured with the text type label, so that the text type of the target text is recognized by directly using the text cluster of the determined text type, and the text recognition efficiency is improved without separately performing offline recognition on each text, and the problem of low text recognition efficiency caused by the fact that junk texts cannot be found in time due to offline processing in the related technology is solved. In addition, under the condition of rapidly and efficiently identifying the malicious text, the method is favorable for purifying the environment of a network platform for displaying or sharing the content based on the content generated by the user, avoids information interference caused by a large amount of illegal text information to the browsing or watching process of the user, provides a sunshine network environment for the majority of users, and ensures the browsing safety of the users.
As an alternative, the searching for a target text cluster matching the target text vector in the stored text cluster set includes:
s1, searching in a cluster map constructed based on the text cluster set, wherein each text vector is set as an element in the cluster map, and connecting lines are arranged among elements corresponding to the text vectors in the same object text cluster;
and S2, under the condition that the target clustering center vector is found in the clustering map, determining the target text clustering cluster indicated by the target clustering center vector as the target text clustering cluster, wherein the distance between the target clustering center vector and the target text vector is greater than the target distance threshold.
Optionally, in this embodiment, the searching for the target text cluster may be, but is not limited to, a vector retrieval technology based on HNSW, where HNSW is a vector retrieval method based on a graph provided by the related technology, and specifically, a graph structure is formed by constructing a continuous edge for similar vectors in a vector set, so as to improve a query speed. The method utilizes mechanisms such as highways, waste lists, dynamic lists and the like to ensure high-efficiency query speed, and further accelerates the retrieval speed by using a hierarchical graph structure on the basis. A schematic diagram of a vector similarity search algorithm based on HNSW can be shown in fig. 6. In fig. 6, (a) each of the hollow circles represents a vector, and the connecting lines between the hollow circles represent similarity between vectors. Based on the graph structure, a query can be started starting from an entry point. Further, as shown in fig. 6 (b), the graph result may also be multi-level, which is more beneficial to improve the efficiency of search.
In the embodiment of the application, each object text cluster in the text cluster set constructs a corresponding graph structure, so that a target text matched with a target text vector is determined in the graph structure.
The following examples may be specific: it is assumed that an online similarity vector search service is constructed based on HNSW, and a simsvr service is established, so that memory overhead, search request amount, index update synchronization time consumption and the like of the services are reduced. And constructing a cluster map structure for querying similar texts based on the text vectors in all object text cluster in the text cluster set, and setting connecting lines between the similar texts in the same object text cluster if the text vectors are used as elements.
Under the condition of obtaining the target text vector of the target text, the cluster most similar to the target text vector can be searched in the cluster map structure. Namely, traversing each object text cluster, and sequentially comparing the cosine similarity between the cluster center vector of each object text cluster and the target text vector. And under the condition that a target clustering center vector with cosine similarity exceeding a target threshold value is found, determining that the target text vector hits the existing text clustering cluster set, determining an object text clustering cluster corresponding to the target clustering center vector as a target text clustering cluster, and acquiring the cluster ID of the target text clustering cluster. And under the condition that the target clustering center vector with cosine similarity exceeding the target threshold is not found, determining that the target text vector does not hit the existing text clustering set, and continuing to enter subsequent processing operation.
By the embodiment provided by the application, the speed of searching and searching the target text cluster matched with the target text vector in the cluster map constructed by combining the text cluster is increased based on the graph structure characteristic of the vector map, so that the search efficiency is improved, and the efficiency of identifying the target text is improved.
As an alternative, finding a target text vector in a cluster map constructed based on a text cluster set includes:
s1, determining candidate text cluster clusters where each word in the target word list is located from the text cluster set according to the index relation recorded in the index list corresponding to the text cluster set, wherein the index relation is the mapping relation between each word in the text cluster set and the object text cluster;
and S2, searching the candidate cluster map corresponding to the candidate text cluster.
Optionally, in this embodiment, the index list may include, but is not limited to, a two-level index list, where the two-level index list includes: a primary index list (also referred to as a forward index list), and a secondary index list (also referred to as an inverted index list). The first-level index list is used for recording the index relation between cluster index identifications corresponding to all object text clusters in the text cluster set and position pointers of positions of all words in text vectors contained in the cluster index identifications in the second-level index list. I.e. the mapping between each object text cluster and the word or words it contains. The secondary index list is used for recording the index relation between each word in the text cluster set and the cluster identification corresponding to the object text cluster where the word is located. Namely the mapping relation between each word and the object text cluster where the word is located.
It should be noted that the cluster index identifier may be a cluster identifier (i.e., a cluster ID) allocated to the object text cluster for use in global differentiation, or may be an identifier separately set for constructing an index list, which is not limited in this embodiment.
Optionally, in order to reduce the search amount and improve the search efficiency, in this embodiment, a screening preprocessing may be performed on the search range to determine candidate text cluster clusters with a smaller range from the text cluster set.
When the target text vector is obtained, the object text cluster where each word related in the target word list corresponding to the target text vector is located may be found based on an index relationship in an index list corresponding to the text cluster set, such as a secondary index list (which may also be referred to as an inverted index list). If the text cluster set comprises N object text cluster sets, analyzing a target word list of a target text vector to obtain words in sentences of the target text: w1, W2, W3 and W4. Then the object text cluster where the word W1 is located can be directly found in the inverted index list as cluster B1, the object text cluster where the word W2 is located is cluster B2, the object text cluster where the word W3 is located is also cluster B2, and the object text cluster where the word W4 is located is cluster B3. Then a candidate set { B1, B2, B3} can be determined from the whole text cluster set, and determined as a candidate text cluster, and searched in a candidate cluster map corresponding to the candidate text cluster.
According to the embodiment provided by the application, the search range is narrowed down by determining the candidate text cluster, so that the search amount for searching the target text vector is reduced conveniently, and the effect of improving the search efficiency is further realized.
As an optional scheme, after searching for a target text cluster matching the target text vector in the stored text cluster set, the method further includes:
s1, under the condition that the target text cluster is not found in the text cluster set, creating a new object text cluster by using the target text vector;
and S2, updating the text cluster set according to the new object text cluster.
It should be noted that, in this embodiment, under the condition that no text cluster is found in the existing text cluster set, real-time clustering processing is performed on the target text to be currently identified. That is, the target text is judged in real time, and under the condition that the judgment result indicates that the target text reaches the clustering condition, the target text is clustered into a new object text clustering cluster in real time. Therefore, online and timely discovery of unclustered junk texts is achieved, the malicious triggered junk texts are timely integrated and stored in the text cluster set, so that malicious texts can be identified by directly utilizing the text cluster set subsequently, and the identification efficiency of texts is improved.
As an alternative, updating the text cluster set according to the new object text cluster includes:
1) under the condition that the number of the currently stored object text cluster clusters in the text cluster set does not reach a target numerical value, directly adding a new object text cluster to the text cluster set;
2) and under the condition that the number of the currently stored object text cluster in the text cluster set reaches a target numerical value, removing at least one object text cluster from the text cluster set, and adding a new object text cluster into the text cluster set.
It should be noted that in this embodiment, an upper limit value (i.e., a target value) is configured for the text cluster set, so that the number of cluster clusters is prevented from being increased without limitation, the workload of traversing, searching and comparing is further reduced, and the purpose of improving the text recognition efficiency is achieved.
The following is explained with reference to the process shown in fig. 7: assuming that a target text cluster matching the target text vector is not found in the text cluster set, the real-time clustering module 506 shown in fig. 5 may be employed to determine whether to create a new object text cluster based on the target text vector.
In step S702, a text vector q (i.e., a target text vector shown in the figure) corresponding to a new text is obtained, and in step S704, a candidate set formed by M object text clusters where M word groups appearing in the text vector q are respectively located is obtained according to the inverted index list. And then judging whether the text vector q is similar to any one of the text cluster in the candidate set (for example, the cosine distance between the text vector q and the cluster center vector of the text cluster is taken as the similarity between the text vector q and the cluster center vector of the text cluster). If it is determined that the text vector q is similar to a text cluster (if the similarity is greater than a specified threshold), in step S706-1, the text vector q is assigned to the text cluster (clustered into an existing text cluster).
And under the condition that the text vector q is determined to be not similar to one text cluster (if the similarity is less than or equal to the specified threshold), in step S706-2, acquiring the number of the current text cluster, and judging whether the number reaches the preset upper limit value. Under the condition that the upper limit value is not reached, judging whether a new object text cluster is created based on the text vector q or not in step S708-1, and adding the new object text cluster to the existing text cluster set in step S710-1; under the condition that the upper limit value is determined to be reached, the step S708-2 and the step S710-2 are required to determine an old class to be removed from the existing text cluster, and the old class is removed, so that a new object text cluster is created based on the text vector q, and the purpose of adding a new text cluster is achieved. It should be noted that, in the case that the old class to be removed cannot be determined, discarding processing may be performed on the current target text vector to complete the current recognition process.
In the process of adding a new text cluster or removing an old text cluster, step S712 is further executed to update the corresponding index list. Therefore, in the next query process, the updated index list can be directly utilized to assist in completing the search and identification process of the text vectors in the text cluster set.
According to the embodiment provided by the application, the upper limit value (namely the target value) is set for the number of the object text cluster stored in the text cluster set, so that the situation that too many text clusters are stored in the text cluster set without limit is avoided, the workload of comparison in the process of searching and traversing is further reduced, the speed of completing traversing searching in the existing text cluster set is increased, and the effect of improving the text recognition efficiency is further achieved.
As an alternative, updating the text cluster set according to the new object text cluster includes:
when a new object text cluster is added to a text cluster set, a two-level index list corresponding to the text cluster set is updated, wherein the two-level index list comprises a first-level index list and a second-level index list, cluster index identifications corresponding to all object text clusters in the text cluster set are recorded in the first-level index list, index relations between position pointers of all words contained in the object text clusters indicated by the cluster index identifications and the positions of the second-level index list are recorded in the second-level index list, and index relations between all words appearing in the text cluster set and the cluster identifications corresponding to the object text clusters where the words are located are recorded in the second-level index list.
It should be noted that the index list in the text cluster set may include, but is not limited to, a magnitude index list, which not only ensures query efficiency, but also supports functions such as quick addition and deletion.
For example, as shown in fig. 8, the secondary index list records the index relationship between each word and the cluster identifier of the object text cluster where the word is located. The secondary Index List may be an Inverted Index or a forward Index, where the Inverted Index (Inverted Index table, also called Inverted Index) is a data structure named Hash table (Hash Map) and is composed of two parts, namely, Key (Key) and Value (Value), and the Value is represented in the form of a Doubly Linked List (Doubly Linked List). Specifically, the structural diagram of the application is shown in the upper part of fig. 8, and the index cluster identifier corresponding to the word a includes: cluster 1, cluster 2, and cluster 5; the index cluster identifier corresponding to the word B includes: cluster 2 and cluster 3.
Further, as shown in fig. 8, each object text cluster and the index relationship of the position pointer of the position of each word in the secondary index list are recorded in the first-order index list (forward-order index list). The positive index is also a hash table structure, and the values are in the form of a single linked list. The key is a cluster ID, and a Pointer (Ptr) to a node in the value linked list of the inverted index is stored in the value linked list.
It should be noted that the linked list in the index list has the following characteristics: the nodes are modified, added and deleted very quickly, but the whole linked list needs to be traversed to search one node, namely, the searching is very slow. The doubly linked list supports traversal in two directions relative to a common singly linked list. Further, for the inverted index structure: the method can quickly inquire which clusters a certain word appears in, but if the certain cluster is deleted or modified from the inverted index, the efficiency is very low, and all linked list nodes corresponding to all the words in the cluster need to be traversed. For the row-by-row index structure: the position of a certain cluster in the inverted index can be quickly inquired, so that the cluster information in the inverted index can be conveniently modified, and the defect of the inverted index is overcome.
As an optional implementation manner, updating the two-level index list corresponding to the text cluster set includes:
1) under the condition of removing at least one object text cluster, searching a position pointer of each word corresponding to the at least one object text cluster at the position of the secondary index list in the primary index list; removing the object text cluster in which each word indicated by the position pointer is respectively positioned;
2) under the condition of adding at least one object text cluster, adding an index relation between each word appearing in at least one object text cluster and a cluster identifier corresponding to the object text cluster in which the word is positioned in a secondary index list; and creating a cluster index identifier of at least one object text clustering cluster in the primary index list, and creating an index relationship between the cluster index identifier and a position pointer of each word contained in the object text clustering cluster indicated by the cluster index identifier and the position of the secondary index list.
For example, in the case of removing an old object text cluster, a cluster index identifier (e.g., the cluster index identifier K-2) of the old object text cluster to be removed is determined, and words, such as W1 and W9, contained in the text cluster identified by the cluster index identifier K-2 can be found based on the first-level index list, and corresponding position pointers of the words are P1 and P9. And further searching a cluster identification ID-1 corresponding to P1 and a cluster identification ID-2 corresponding to P2 based on the secondary index list, then executing elimination processing on the text cluster identified by the cluster identification ID-1 and the text cluster identified by the cluster identification ID-2, and synchronously deleting the related index relation recorded by the index list.
For example, in the case of adding a new object text cluster, a plurality of words included in the new object text cluster are determined, and an index relationship is created for each word in the secondary index list. And then, based on the position numbers of the positions of the words in the secondary index list, creating an index relationship between the position pointer and the cluster index identifier in the primary index list.
It should be noted that the cluster index identifier (Key, abbreviated as K) may be a cluster identifier (i.e., a cluster ID) allocated to the object text cluster for global differentiated use, or may be a global identifier allocated to construct an index list separately, which is not limited in this embodiment.
By the embodiment provided by the application, the object text cluster is managed based on the two-level index list, such as query, addition, deletion and other operations, and the management efficiency is greatly improved.
As an optional scheme, before adding the new object text cluster to the text cluster set, the method includes:
s1, reporting the new object text cluster to a server under the condition that the occurrence frequency of the new object text cluster is greater than a first threshold value, so that the server can gather the new object text cluster reported by each clustering device;
and S2, under the condition that the convergence result of the server indicates that the cluster size of the new object text cluster is larger than a second threshold value, allocating a cluster identifier to the new object text cluster.
Optionally, in this embodiment, when creating a new object text cluster based on the target text vector, a certain clustering condition needs to be reached, where the clustering condition here may include, but is not limited to, that two times of clustering determination reach a corresponding threshold condition. It should be noted that, in the clustering process, the online traffic is actually distributed in a plurality of clustering devices, and in order to implement uniform cluster numbers and statistics, in this embodiment, the method may be but is not limited to summarize the clustering results of the single device reported by each clustering device, and update the latest clustering results into the vector search service.
For example, as shown in fig. 9, in the process of performing text recognition in parallel by a plurality of devices (e.g., three clustering devices are assumed), as step S902, a clustering process may be performed in parallel in each device, such as obtaining a text vector in each clustering device, and reporting a text vector cluster with an occurrence frequency greater than a first threshold (also referred to as a low threshold as shown in the figure) in the current clustering device to the aggregation device. If each clustering device executes steps S904-1, S904-2, and S904-3 (i.e., low threshold trigger reporting), respectively.
The aggregation device executes step S906 to perform result aggregation (i.e., secondary clustering) on the reported content. After the cluster size of the current secondary cluster reaches the second threshold (which may also be referred to as the high threshold as shown in the figure), in step S908, the result of the secondary cluster is reported to the offline database 902 (i.e., the high threshold triggers reporting). The database 902 includes a cluster report table 9022 and a global cluster table 9024, where the cluster report table 9022 is used to implement cluster control, for example, in step S910, the clustering results are fused to obtain a cluster. The global cluster table 9024 is configured to allocate a new global ID to the new object text cluster, as shown in steps S912 to S914, request to acquire an existing cluster for comparison, and add a new cluster to the global cluster table 9024. And then updates its timing synchronization to the online vector search service 900, as in step S916, periodically updates, so that the vector search service 900 quickly searches out the latest target text cluster matching the target text vector.
Through the embodiment provided by the application, the text vectors meeting the clustering condition are clustered in real time, clustering clusters reported by different devices are filtered and managed through secondary clustering, global identification is prevented from being distributed to clusters with low occurrence frequency or small cluster size, or global resources are prevented from being occupied, and unnecessary resource waste is reduced.
As an optional scheme, the obtaining of the target word list corresponding to the target text to be recognized includes: performing preprocessing operation on the target text to obtain a target word list, wherein the preprocessing operation comprises the following steps: word segmentation operation, redundant character removal operation and format conversion operation; generating a target text vector corresponding to the target text by using the target word list comprises: and converting the target text into a target text vector by adopting a distilled language vector conversion model, wherein the number of network layers in the distilled language vector conversion model is smaller than that in the original language vector conversion model, and the language vector conversion model is a coding representation model for language processing.
Optionally, in this embodiment, but not limited to, TinyBert obtained after distilling based on the Bert language processing model may be used to implement the text-to-text vector conversion process. The model structure of TinyBert here is the same as that of Bert, except that the number of network layers is smaller and the number of parameters is smaller. Here, fig. 10 shows a model structure of Bert, where (a) in fig. 10 is a transform structure and (b) in fig. 10 is a Bert structure.
After the language processing model has been trained with sample data, text vectors can be generated by: performing preprocessing operation on the text, for example, the preprocessing operation can be word segmentation, word stop removal, simplified word conversion from traditional to simplified word, special character removal and the like, so as to obtain a space-separated word list; converting the word list into a digital ID form, and inputting the digital ID form into the online model; the model will output the text vector corresponding to the text.
According to the embodiment provided by the application, the text vector is generated by using the Bert-based language processing model and the model distillation technology, so that the text conversion processing process is accelerated by using the model with fewer network layers, the text type of the text can be rapidly identified based on the converted text vector, and the effect of improving the text identification efficiency is achieved.
As an optional scheme, after determining the target text as the abnormal text, the method further includes:
s1, pulling a plurality of object text cluster clusters in a hot state in a target time period from the text cluster set regularly;
and S2, configuring text type labels for the object text cluster.
It should be noted that after the text cluster set is updated, a text type label (i.e., a mark) may be periodically assigned to the text cluster set of the object therein. For example, the first N hot object text cluster clusters in the current time period (1 hour) are pulled at regular time (e.g., every hour), and are imported into a marking system, and marking is performed on the cluster clusters in the system, for example, it is determined according to an actual operation result that each object text cluster in the first N hot object text cluster clusters is determined whether the cluster is a malicious triggered spam text, and if the cluster is a spam text, the marking result is that a text type tag is configured for the cluster: and if the malicious cluster is not a junk text, marking a text type label configured for the malicious cluster according to a marking result: and (4) non-malicious clustering. Then, the marking result is submitted and fed back to the line in time so as to realize the purpose of real-time clustering and ensure the real-time performance.
In addition, in this embodiment, the marking operation may be, but is not limited to, configuring a text type tag for any text in one object text cluster, so that the object text cluster of this type shares the text type tag without repeating the marking operation.
Through the embodiment provided by the application, the online real-time clustering marking result is updated in real time, so that the identification process of the malicious triggered junk text (abnormal text) is accelerated, and the effective attack on the malicious triggered junk text is promoted.
As an optional scheme, after determining the target text as the abnormal text, the method further includes: executing at least one of the following operations on the source account or the source IP address which sends the abnormal text: sending alarm information and terminating the use authority.
It should be noted that, in this embodiment, for the identified abnormal text (i.e., malicious triggered spam text), a penalty process may be further performed, for example, an alarm message is sent to a source account or a source IP address triggering the abnormal text, so as to prompt an abnormal state of the abnormal text; and if the source account or the source IP address triggering the abnormal text is forbidden, the use permission of the abnormal text is stopped, and the phenomenon that the abnormal text is maliciously triggered again to cause information interference to the user is avoided.
As an optional scheme, after determining the target text as the abnormal text, the method further includes:
s1, acquiring clustering information of the abnormal texts identified in the target period, wherein the clustering information at least comprises magnitude change information of the abnormal texts;
and S2, pushing the clustering information to each client side in a chart form for displaying.
Optionally, in this embodiment, a data analysis function of a mature data analysis tool (e.g., Clickhouse) may be adopted to analyze the content of the identified abnormal text, and display an analysis result in real time. If so, displaying the current text clustering condition (displaying the current most popular text content), and drawing a magnitude change trend curve of the suspicious clustering cluster; further, for clustering clusters with abnormal magnitude change, automatic early warning can be performed according to a curve abnormality detection algorithm. And for malicious clusters found in the monitoring system, the operation of operators in the monitoring system is supported.
Through the embodiment provided by the application, under the condition that the abnormal text is timely found and recognized, the real-time monitoring result of the abnormal text is further improved, early warning is carried out on the malicious cluster which is timely found, and unnecessary loss is avoided.
The description is made with specific reference to the following examples:
suppose that a video content sharing platform is taken as an example, a video segment is shared by a user account ID-a through the platform, and comment texts are respectively issued for the video segment by concerning the fan accounts ID-a to ID-c of the user account ID-a. Suppose that the fan account ID-a issues a comment text a: "content true Bar! "; the fan account ID-b issues a comment text b: ' very creative! "; the fan account ID-c issues a plurality of repeated texts c: "is a high loan required? ". Here, the contents issued by the fan account ID-c are repeated, and the contents themselves are not beneficial to the user account ID-a receiving feedback of the video issued by the fan account, and are also not beneficial to the viewing of other user accounts. In fact, certain information interference is caused to the user account ID-A and other user accounts.
By the text recognition method provided in this embodiment, spam texts in the text a, the text b, and the text c can be recognized by comparing and searching the stored text cluster set. That is, the text c can be accurately recognized as an abnormal text (i.e., spam text). Under the condition that the operation frequency of the fan account ID-c for issuing the similar abnormal text reaches the threshold value, the alarm information can be directly sent to the fan account ID-c by the background server or the use permission of the fan account ID-c is stopped, so that the purpose of safety protection of the browsing environment of a large number of users is achieved.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.
According to another aspect of the embodiment of the invention, a text recognition device for implementing the text recognition method is also provided. As shown in fig. 11, the apparatus includes:
1) an obtaining unit 1102, configured to obtain a target word list corresponding to a target text to be identified;
2) a generating unit 1104, configured to generate a target text vector corresponding to the target text by using the target word list;
3) a searching unit 1106, configured to search a target text cluster matched with a target text vector in a stored text cluster set, where the text cluster set is to be updated periodically, and the text cluster set includes an object text cluster configured with a text type tag, and the text type tag is used to indicate that the object text cluster is a malicious cluster or a non-malicious cluster;
4) the identifying unit 1108 is configured to find a target text cluster in the text cluster set, and determine the target text as an abnormal text when a text type tag corresponding to the target text cluster indicates that the target text cluster is a malicious cluster.
Optionally, in this embodiment, the text recognition apparatus may be applied, but not limited to, in an application scenario where Content presentation or promotion is performed based on User Generated Content (UGC), where the UGC refers to presenting or providing Content originally created by the User to other users through an internet platform. Such as friend social networks, video/photo/knowledge sharing networks, communities/forums, etc. Optionally, the text information in the application scenario corresponding to the application scenario, to which the text recognition method provided in this embodiment is applied, may include, but is not limited to, at least one of the following: forwarding text information of an out-link forwarding scene, shared text information in a community shared space, comment text information on displayed content, and call text information used in account binding relationship (such as registration, friend account addition, and the like). The identification process is mainly to identify the junk texts which are issued by the user through the network platform, have high occurrence frequency and have repeated contents in different application scenes, such as text information sent by the user in a 'drift bottle' form, possibly having repeated contents, and even text contents related to illegal sensitive information (such as yellow gambling poison information). The junk texts are malicious texts which are not triggered by some malicious subjects to achieve a certain purpose, and cause great information interference to broad users who pay attention to contents such as web interfaces or videos and the like when the users normally browse the web interfaces, so that the network browsing environment of the broad users cannot be safely guaranteed.
For specific embodiments, reference may be made to the above method embodiments, which are not described herein again.
According to another aspect of the embodiment of the present invention, there is also provided an electronic device for implementing the text recognition method, where the electronic device may be a server shown in fig. 1. As shown in fig. 12, the electronic device comprises a memory 1202 and a processor 1204, the memory 1202 having stored therein a computer program, the processor 1204 being arranged to perform the steps of any of the above-described method embodiments by means of the computer program.
Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring a target word list corresponding to the target text to be recognized;
s2, generating a target text vector corresponding to the target text by using the target word list;
s3, searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated periodically, and the text cluster set comprises an object text cluster configured with a text type label, and the text type label is used for indicating that the object text cluster is a malicious cluster or a non-malicious cluster;
and S4, under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster, determining the target text as an abnormal text.
Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 12 is a diagram illustrating a structure of the electronic device. For example, the electronics may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 12, or have a different configuration than shown in FIG. 12.
The memory 1202 may be used to store software programs and modules, such as program instructions/modules corresponding to the text recognition method and apparatus in the embodiments of the present invention, and the processor 1204 executes various functional applications and data processing by running the software programs and modules stored in the memory 1202, that is, implementing the text recognition method described above. The memory 1202 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 1202 can further include memory located remotely from the processor 1204, which can be connected to a terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 1202 may be, but not limited to, specifically configured to store information such as target text to be recognized and a recognition result. As an example, as shown in fig. 12, the memory 1202 may include, but is not limited to, an extracting unit 1102, a determining unit 1104, a generating unit 1106, and a processing unit 1108 of the text recognition device. In addition, other module units in the text recognition apparatus may also be included, but are not limited to these, and are not described in detail in this example.
Optionally, the transmitting device 1206 is configured to receive or transmit data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmitting device 1206 includes a Network adapter (NIC) that can be connected to a router via a Network cable to communicate with the internet or a local area Network. In one example, the transmitting device 1206 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
In addition, the electronic device further includes: a display 1208, configured to display the target text to be recognized and the recognition result; and a connection bus 1210 for connecting the respective module parts in the above-described electronic apparatus.
In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. Nodes can form a Peer-To-Peer (P2P, Peer To Peer) network, and any type of computing device, such as a server, a terminal, and other electronic devices, can become a node in the blockchain system by joining the Peer-To-Peer network.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the text recognition method described above. Wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring a target word list corresponding to the target text to be recognized;
s2, generating a target text vector corresponding to the target text by using the target word list;
s3, searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated periodically, and the text cluster set comprises an object text cluster configured with a text type label, and the text type label is used for indicating that the object text cluster is a malicious cluster or a non-malicious cluster;
and S4, under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster, determining the target text as an abnormal text.
Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (15)

1. A text recognition method, comprising:
acquiring a target word list corresponding to a target text to be recognized;
generating a target text vector corresponding to the target text by using the target word list;
searching a target text cluster matched with the target text vector in a stored text cluster set, wherein the text cluster set is updated periodically, and the text cluster set comprises an object text cluster configured with a text type label, and the text type label is used for indicating that the object text cluster is a malicious cluster or a non-malicious cluster;
and determining the target text as an abnormal text under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster.
2. The method of claim 1, wherein the finding a target text cluster in the set of stored text clusters that matches the target text vector comprises:
searching a cluster map constructed based on the text cluster set, wherein each text vector is set as an element in the cluster map, and connecting lines are arranged among elements corresponding to the text vectors in the same object text cluster;
and under the condition that a target clustering center vector is found in the clustering graph, determining an object text clustering cluster indicated by the target clustering center vector as the target text clustering cluster, wherein the distance between the target clustering center vector and the target text vector is greater than a target distance threshold.
3. The method of claim 2, wherein finding the target text vector in a cluster map constructed based on the set of text clusters comprises:
determining candidate text cluster clusters where all words in the target word list are located from the text cluster set according to index relations recorded in an index list corresponding to the text cluster set, wherein the index relations are mapping relations between all the words in the text cluster set and object text cluster clusters;
and searching the candidate cluster map corresponding to the candidate text cluster.
4. The method of claim 1, further comprising, after finding a target text cluster matching the target text vector in the set of stored text clusters,:
under the condition that the target text cluster is not found in the text cluster set, creating a new object text cluster by using the target text vector;
and updating the text cluster set according to the new object text cluster.
5. The method of claim 4, wherein updating the set of text cluster clusters based on the new object text cluster comprises:
under the condition that the number of the currently stored object text cluster clusters in the text cluster set does not reach a target numerical value, directly adding the new object text cluster clusters into the text cluster set;
and under the condition that the number of the currently stored object text cluster clusters in the text cluster set reaches the target numerical value, removing at least one object text cluster from the text cluster set, and then adding the new object text cluster to the text cluster set.
6. The method of claim 5, wherein updating the set of text cluster clusters based on the new object text cluster comprises:
when the new object text cluster is added to the text cluster set, updating a two-level index list corresponding to the text cluster set, wherein the two-level index list comprises a first-level index list and a second-level index list, the first-level index list records cluster index identifications corresponding to all object text clusters in the text cluster set, index relationships between position pointers of all words contained in the object text clusters indicated by the cluster index identifications and the positions of the second-level index list, and the second-level index list records index relationships between all words appearing in the text cluster set and cluster identifications corresponding to the object text cluster where the words are located.
7. The method of claim 6, wherein updating the two-level index list corresponding to the set of text clusters comprises:
under the condition of eliminating at least one object text cluster, searching a position pointer of each word corresponding to the at least one object text cluster at the position of the secondary index list in the primary index list; removing the object text cluster in which each word indicated by the position pointer is respectively positioned;
under the condition of adding at least one object text cluster, adding an index relation between each word appearing in the at least one object text cluster and a cluster identifier corresponding to the object text cluster in which the word is positioned in the secondary index list; and creating a cluster index identifier of the at least one object text clustering cluster in the primary index list, and establishing an index relationship between the cluster index identifier and a position pointer of each word contained in the object text clustering cluster indicated by the cluster index identifier and the position of the secondary index list.
8. The method of claim 5, wherein prior to adding the new object text cluster to the set of text clusters, comprising:
reporting the new object text cluster to a server under the condition that the occurrence frequency of the new object text cluster is greater than a first threshold value, so that the server can gather the new object text cluster reported by each clustering device;
and under the condition that the convergence result of the server indicates that the cluster size of the new object text cluster is larger than a second threshold value, distributing cluster identification to the new object text cluster.
9. The method according to any one of claims 1 to 8,
the acquiring of the target word list corresponding to the target text to be recognized includes: performing a preprocessing operation on the target text to obtain the target word list, wherein the preprocessing operation comprises: word segmentation operation, redundant character removal operation and format conversion operation;
the generating a target text vector corresponding to the target text by using the target word list includes: converting the target text into the target text vector by adopting a distilled language vector conversion model, wherein the number of network layers in the distilled language vector conversion model is smaller than that in an original language vector conversion model, and the language vector conversion model is a coding expression model for language processing.
10. The method according to any one of claims 1 to 8, further comprising, after determining the target text as an abnormal text:
periodically pulling a plurality of object text cluster clusters in a hot state in a target time period from the text cluster set;
and configuring the text type labels for the plurality of object text clustering clusters.
11. The method according to any one of claims 1 to 8, further comprising, after determining the target text as an abnormal text:
executing at least one of the following operations on the source account or the source IP address which sends the abnormal text: sending alarm information and terminating the use authority.
12. The method according to any one of claims 1 to 8, further comprising, after determining the target text as an abnormal text:
acquiring clustering information of the abnormal texts identified in a target period, wherein the clustering information at least comprises magnitude change information of the abnormal texts;
and pushing the clustering information to each client side in a chart form for displaying.
13. A text recognition apparatus, comprising:
the acquisition unit is used for acquiring a target word list corresponding to a target text to be recognized;
the generating unit is used for generating a target text vector corresponding to the target text by using the target word list;
a searching unit, configured to search a target text cluster matched with the target text vector in a stored text cluster set, where the text cluster set is to be updated periodically, and the text cluster set includes an object text cluster configured with a text type tag, where the text type tag is used to indicate that the object text cluster is a malicious cluster or a non-malicious cluster;
and the identification unit is used for determining the target text as the abnormal text under the condition that the target text cluster is found in the text cluster set and the text type label corresponding to the target text cluster indicates that the target text cluster is a malicious cluster.
14. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 12.
15. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 12 by means of the computer program.
CN202011255568.XA 2020-11-11 2020-11-11 Text recognition method and device, storage medium and electronic equipment Pending CN112256880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011255568.XA CN112256880A (en) 2020-11-11 2020-11-11 Text recognition method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011255568.XA CN112256880A (en) 2020-11-11 2020-11-11 Text recognition method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112256880A true CN112256880A (en) 2021-01-22

Family

ID=74265519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011255568.XA Pending CN112256880A (en) 2020-11-11 2020-11-11 Text recognition method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112256880A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN113776537A (en) * 2021-09-07 2021-12-10 山东大学 De-centralization multi-agent navigation method and system in unmarked complex scene
WO2023058099A1 (en) * 2021-10-04 2023-04-13 富士通株式会社 Processing method, processing program, and information processing device
CN117688139A (en) * 2024-02-01 2024-03-12 中国信息通信研究院 Text searching method and device based on industrial Internet identification in blockchain

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883190A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN113776537A (en) * 2021-09-07 2021-12-10 山东大学 De-centralization multi-agent navigation method and system in unmarked complex scene
CN113776537B (en) * 2021-09-07 2024-01-19 山东大学 De-centralized multi-agent navigation method and system in unmarked complex scene
WO2023058099A1 (en) * 2021-10-04 2023-04-13 富士通株式会社 Processing method, processing program, and information processing device
CN117688139A (en) * 2024-02-01 2024-03-12 中国信息通信研究院 Text searching method and device based on industrial Internet identification in blockchain

Similar Documents

Publication Publication Date Title
CN111782965B (en) Intention recommendation method, device, equipment and storage medium
CN112256880A (en) Text recognition method and device, storage medium and electronic equipment
Edwards et al. A systematic survey of online data mining technology intended for law enforcement
CN116157790A (en) Document processing and response generation system
CN109905288B (en) Application service classification method and device
CN112468347B (en) Security management method and device for cloud platform, electronic equipment and storage medium
CN105631749A (en) User portrait calculation method based on statistical data
CN109376288B (en) Cloud computing platform for realizing semantic search and balancing method thereof
CN112000889A (en) Information gathering and presenting system
CN110399564B (en) Account classification method and device, storage medium and electronic device
CN104636386A (en) Information monitoring method and device
Kavitha et al. Discovering public opinions by performing sentimental analysis on real time Twitter data
CN105354343B (en) User characteristics method for digging based on remote dialogue
CN112307318A (en) Content publishing method, system and device
KR20220074574A (en) A method and an apparatus for analyzing real-time chat content of live stream
CN113495945A (en) Text search method, text search device and storage medium
CN115296892B (en) Data information service system
López-Ramírez et al. Geographical aggregation of microblog posts for LDA topic modeling
Yang et al. Deep learning-based reverse method of binary protocol
Barrero et al. Adapting searchy to extract data using evolved wrappers
CN115114519A (en) Artificial intelligence based recommendation method and device, electronic equipment and storage medium
Murthy et al. TwitSenti: a real-time Twitter sentiment analysis and visualization framework
CN114328818A (en) Text corpus processing method and device, storage medium and electronic equipment
Zhang et al. Event-radar: Real-time local event detection system for geo-tagged tweet streams
CN112231700A (en) Behavior recognition method and apparatus, storage medium, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination