CN112052676A

CN112052676A - Text content processing method, computer equipment and storage medium

Info

Publication number: CN112052676A
Application number: CN202010897035.5A
Authority: CN
Inventors: 郭芳; 于云成; 王炳功; 於雪松; 于志鹏; 姜乃榕; 刘子正; 秦冲; 张巍; 王晓燕; 沙鑫; 车晨; 滕建港; 张英; 张玉苗; 张雪玮; 滕瑶琪; 陈林; 邹承志
Original assignee: Rongcheng Power Supply Co Of State Grid Shandong Electric Power Co
Current assignee: Rongcheng Power Supply Co Of State Grid Shandong Electric Power Co
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-08
Anticipated expiration: 2040-08-31
Also published as: CN112052676B

Abstract

The invention discloses a text content processing method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a size parameter of a target text; performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings; when the target character string is determined to have the preset character, sending out prompt information; according to the method and the device, different word segmentation processing methods can be carried out on the character strings in the electronic file according to the size of the file in the electronic file, and the weak sensitive words are determined by adopting a corresponding method on the target character string set formed after word segmentation, so that the application of the electronic file is prevented from being influenced by omission of the weak sensitive words, and the query speed of the weak sensitive words is improved.

Description

Text content processing method, computer equipment and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method for processing text content, a computer device, and a storage medium.

Background

At present, sensitive words often appear in electronic documents, and are generally divided into strong sensitive words and weak sensitive words, wherein the strong sensitive words are sensitive words which must be checked out and are avoided in the electronic documents, while the weak sensitive words are sensitive words which are checked out as much as possible, and although the sensitive words are sensitive, the influence is limited as long as the sensitive words are not widely spread in a large range; in the prior art, monitoring on strong sensitive words is strict, but for weak sensitive words, due to the fixation of a word segmentation dictionary and the influence of the context of the sensitive words, the weak sensitive words are accurately segmented and monitored, and further the application of electronic files is influenced.

Disclosure of Invention

In order to solve the problems in the prior art, different word segmentation processing methods can be carried out on character strings in an electronic file according to the size of the file in the electronic file, weak sensitive words are determined by adopting a corresponding method on a target character string set formed after word segmentation, the application of the electronic file is prevented from being influenced by omission of the weak sensitive words, and the query speed of the weak sensitive words is improved; the embodiment of the invention provides a text content processing method, computer equipment and a storage medium. The technical scheme is as follows:

in one aspect, a method for processing text content, the method comprising the steps of:

acquiring a size parameter of a target text;

performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings;

and sending prompt information when the target character string is determined to have the preset character.

In another aspect, a computer device includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the processing method as described above.

In another aspect, a computer-readable storage medium stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by a processor to implement the processing method as described above.

The processing method, the computer equipment and the storage medium of the text content have the following technical effects:

based on the technical scheme of the invention, different word segmentation processing methods are adopted for character strings in the electronic file according to the size of the file in the electronic file to obtain different target character string sets, and different methods are adopted for different character string sets to determine weak sensitive words; therefore, according to the technical scheme, an appropriate word segmentation processing method is selected based on the file size of the text, weak sensitive words are determined in the target character string set generated based on the word segmentation mode, the phenomenon that the word segmentation accuracy of the weak sensitive words is influenced by the word segmentation method and the context is avoided, the influence of the sensitive words on the application of the electronic file is further avoided, and the query speed of the sensitive words is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text content processing method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method for processing text content provided by the embodiment of the invention can be applied to any computer equipment with data processing capability, the computer equipment can be a terminal or a server, and the computer equipment can be independently executed or executed in a cluster cooperation mode when executing the method for establishing the index table of the video library provided by the embodiment of the invention.

The present embodiment provides a method for processing text content, fig. 1 is a flowchart of a method for processing text content provided by the present embodiment, and the present specification provides the method operation steps as described in the embodiments or the flowchart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 1, the method may include the steps of:

s101, acquiring a size parameter of a target text;

specifically, the size parameter of the target text represents the data amount of the target text, wherein the data amount unit may be Byte (B, Byte), Kilobyte (KB), Megabyte (MB), Gigabyte (GB), or the like.

In the present embodiment, the target text refers to a document, such as a manuscript, a paper, a promo, etc., applied to the printing apparatus, and is not limited in the present embodiment.

S103, performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings;

specifically, the target character string set refers to character strings generated by arranging a plurality of target character strings according to a preset sequence, wherein the target character strings include a first target character string and a second target character string, and only one of the first target character string and the second target character string is a target character string generated for the same target text.

Furthermore, the rules of the preset sequence are that the target character strings are arranged according to the rules of the lengths of the target character strings from large to small, so that the comparison between the target character strings and the sensitive words in the sensitive word bank can be facilitated, and the query speed of the sensitive words is improved.

In this embodiment, the method further includes obtaining the target string set by the following method:

performing character conversion on the target text to generate a candidate character string set;

judging whether the size parameter of the target text is smaller than a preset parameter threshold value or not;

when the size parameter of the target text is smaller than the preset parameter threshold, performing word segmentation processing on the candidate character string set to obtain a first target character string set;

and when the size parameter of the target text is not smaller than the preset parameter threshold, performing word segmentation processing on the candidate character string set to obtain a second target character string set.

Specifically, the candidate character string set includes a plurality of candidate character strings, the candidate character strings refer to a chinese character string that needs to be participled, and the candidate character strings are continuous character strings without punctuation segmentation in the character strings, for example, when a character string "a party does not handle initial registration of house ownership on time or does not assist a party b in handling transfer registration of house ownership, which causes a loss to the party b, and the party a should take responsibility" is participled, punctuation segmentation is performed in the middle of the character string, which is discontinuous, and the character string cannot be used as a candidate character string; according to the positions of punctuations, dividing the character string into three sub-character strings of 'first party does not handle initial registration of house ownership on time or does not assist second party in handling transfer registration of house ownership', 'second party is lost' and 'first party should take responsibility', taking each sub-character string as a candidate character string, and then sequencing each candidate character string according to the sequence to form a candidate character string set.

Specifically, the method further includes determining the first target character string by a method including:

matching any candidate character string in the candidate character string set with a first disabled word bank;

determining a first stop word according to the matching degree of the candidate character string;

and filtering the first stop word from the candidate character string to generate the first target character string.

Further, a plurality of the first target strings form a first target string set U, where U ═ U (U ═ U)₁，U₂，……，U_m)，m≥1。

Further, the first stop word comprises a predefined word and a punctuation mark; wherein the predefined words at least comprise conjunctions, co-words, int in c-language, etc., such as "and", "or", "of", etc.

For better understanding, in the above method, the first stop word is filtered out from the candidate character string to generate the first target character string, for example, after the candidate character string is "a party does not handle the initial registration of the ownership of the house on time" or does not assist the party b in handling the transfer registration of the ownership of the house ", and the first stop word is" or ", and after the first stop word is filtered out or" is "or", two first target character strings are generated as "a party does not handle the initial registration of the ownership of the house on time" and "does not assist the party b in handling the transfer registration of the ownership of the house", respectively.

Specifically, the method further includes determining the second target character string as follows:

matching any first target character string in the first target character string set with a second disabled word bank;

determining a second stop word according to the matching degree of the first target character string;

and filtering the second stop word from the first target character string to generate the second target character string.

Further, a plurality of the second target character strings form the second target character string set V, where V ═ V (V)₁，V₂，……，V_n)，n≥m。

Further, the second stop word includes a predetermined word or words having a literal meaning, such as, for example, insulting words such as "dying", "egg rolling", etc.

Preferably, when the second target character string is determined, the first target character string is determined using the above method, but the method of determining the first target character string is not limited in the present embodiment.

Specifically, whether the first target character string set or the second target character string set is provided, the target character strings may have different lengths, for example, some character strings are words formed by a single character, and the character strings are character strings of one character size, and some character strings are words formed by a plurality of characters, and the character strings are character strings corresponding to a plurality of characters; the difference in the lengths of the target strings is due to the fact that the target strings are divided according to semantics and are compared with the target strings and the sensitive word stock.

In the embodiment, the texts can be divided into two types through the size parameters of the texts, when the size parameters of the texts are small, the texts are beneficial to propagation, strict word segmentation processing needs to be carried out on the texts, and sensitive words are prevented from being omitted; when the size parameter of the text is large, the text is not beneficial to propagation, loose word segmentation processing can be carried out on the text, the query speed of sensitive words is improved, and the influence of the sensitive words on the application of the file is avoided.

S105, sending prompt information when the target character string is determined to have the preset character;

specifically, the preset character string refers to a character string corresponding to a word in a sensitive thesaurus, and the sensitive thesaurus is a weakly sensitive thesaurus, that is, a preset character string set W (W ═ W-₁，W₂，…，W_k) The weak sensitive word library can be set according to the safety expert experience, the business requirements and the like, and the weak sensitive words in the weak sensitive word library refer to the remaining sensitive words, such as the words of 'yellow', 'miss', and the like, after the strong sensitive words are removed from the whole sensitive word library.

Further, the preset character string W_j，W_jA preset character string representing the jth, where j is 1, 2, …, k; the preset character string W_jComprising a plurality of characters, W_j＝(W_j1，W_ji2，…，W_jy)。

Specifically, the prompt message includes at least one of: a voice prompt, a light prompt, or a shutdown prompt, etc., which is not limited in this embodiment.

In this embodiment, the method further includes the following steps:

comparing the length of the first target character string in the first target character string set with the length of the preset character string;

when the length of the first target character string is smaller than that of the preset character string, determining that the preset character string does not exist in the first target character string;

and when the length of the first target character string is not less than that of the preset character string, matching the first target character string with the preset character string according to a preset matching rule.

Specifically, the first target character string U_i，U_iA first target character string indicating an ith character string, where i ═ 1, 2, …, m; the first target character string U_iComprising a plurality of characters, U_i＝(U_i1，U_i2，…，U_ix)。

For better understanding, the method further includes determining that the first target string has a predetermined string, e.g., when x < y, indicating that the predetermined string is longer in length than the first target string, U may be determined_iMust be absent of W_jThat is, it may further be determined that the preset character string set W does not exist in the first target character string set U; and otherwise, matching the first target character string with the preset character string according to a preset matching rule.

Specifically, when the length of the first target character string is not less than the length of the preset character string, matching the first target character string with the preset character string according to a preset matching rule, and further including;

comparing the length of the first target character string with the length of the preset character string again;

when the length of the first target character string is equal to that of the preset character string, matching the first target character string with the preset character string according to a first preset matching rule;

and when the length of the first target character string is greater than that of the preset character string, matching the first target character string with the preset character string according to a second preset matching rule.

Further, the matching the first target character string with the preset character string according to a first preset matching rule includes:

matching each character in the first target character string with each character in the preset character string;

and when the matching degree of each character in the first target character string meets a preset matching degree, determining that the preset character string exists in the first target character string.

For better understanding, the first target character string is matched with the preset character string according to a first preset matching rule, for example, the length of the first target character string is equal to the length of the preset character string, that is, when x ═ y, it is determined that the matching degree of each character in the first target character string satisfies the preset matching degree, where the preset matching degree may be set to a matching degree representing that one end to be matched is completely consistent with one end to be matched, it may be understood that each character in the first target character string may be completely matched with each character in the preset character string, that is, U_iz＝W_jzAnd z takes a value of 1 … … y, then U is determined_iIs present in W_jThat is, it may further be determined that the preset character string set W exists in the first target character string set U; otherwise, when the matching degree of any character in the first target character string reaches the preset matching degree, determining U_iDoes not exist W_jThat is, it may further be determined that the preset character string set W does not exist in the first target character string set U.

In practical application, when the first target character string is "i am a miss", and the preset character string is "i am a miss", the first target character string is completely matched with each character of the preset character string, and it may be determined that the preset character string exists in the first target character string and is "i am a miss"; when the first target character string is "i am a miss", and the preset character string is "i am a miss", and the matching degree of the character "miss" in the first target character string and the character "little" in the preset character string does not satisfy a preset match, it may be determined that the preset character string does not exist in the first target character string, and the first target character string is "i am a miss".

Further, the matching the first target character string with the preset character string according to a second preset matching rule includes:

splitting the characters in the first target character string according to a preset step length;

and determining that the preset character string exists in the first target character string according to the matching of the split character and the preset character string.

Further, the preset step length is the length of the preset character string, that is, the characters in the first target character string and the characters in the first target character string are split according to the length of the preset character string, which can be understood as: each character in the first target character string is a starting character, and the characters are split into a plurality of characters to be matched according to the arrangement sequence by using the length of the preset character string, for example, the first target character string is U_i＝abcde，W_jU may be expressed as cd_iSplit into ab, bc, cd and de.

Further, the split character is a character to be matched, and the first target character string is determined to have the preset character according to the matching between the split character and the preset character string, which can be understood as: matching the characters to be matched with the preset character strings in sequence, when the matching degree of any one character to be matched meets the preset matching degree, stopping matching the characters to be matched with the preset character strings, and determining that the first target character string has the preset character string, wherein the preset matching degree can be set as the matching degree representing that one end to be matched and one end to be matched are completely consistent; e.g. U_jMiddle ab and W_jThe cd is matched, and the matching degree of the cd does not meet the preset matching degree; proceed with U_jMiddle bc and W_jThe middle cd is matched, and the matching degree of the middle cd does not meet the preset matching degree; proceed with U_jMiddle cd and W_jThe middle cd performs matching, the matching degree meets the preset matching, and the matching is stopped, then U is determined_iIs present in W_jThat is, it may further be determined that the preset character string set W exists in the first target character string set U.

In practical application, the first target character string is Miss, the preset character string is Miss and can be divided into Miss, the Miss can be divided into Miss, Miss and Miss, the Miss in the first target character string is matched with the Miss in the preset character string, when the matching between the Miss in the first target character string and the Miss in the preset character string meets a preset matching degree, the matching is stopped, and the Miss in the first target character string can be determined.

Preferentially, the first target character string is any character string in a first target character string set; the preset character string is any character string in a preset character string set; this embodiment is not limited.

In the above embodiment, the first target character string and the preset character string can be used for quickly querying the target character string with the sensitive word in the target character string set, so that the target character string with the sensitive word is prevented from being omitted, and the query speed of the weakly sensitive word is improved.

In some embodiments, the method further comprises the following step of determining that the first target character string has the preset character string:

when the length of the ith first target character string is smaller than the length of the jth preset character string, and the length of the ith-1 first target character string is not smaller than the length of the jth preset character string, collecting the first target character strings

The length of any target character string and the preset character string W_i+1Comparing the lengths of the first target character string and the second target character string to determine that the first target character string has a preset character string; wherein U ═ U (U)₁，U₂，……，U_i-1)。

In the embodiment, the comparison between part of the first target character string and the preset character string can be omitted, and the query speed of the weakly sensitive words is improved.

In this embodiment, the method further includes the following steps:

comparing the length of any second target character string in the second target character string set with the length of the preset character string;

when the length of the second target character string is equal to that of the preset character string, matching each character in the second target character string with each character in the preset character string;

and when the matching degree of each character in the second target character string meets the preset matching degree, determining that the second target character string has a preset character string.

Further, the second target character string V_i，V_iA second target character string indicating an ith character string, where i is 1, 2, …, n; the first target character string V_iComprising a plurality of characters, V_i＝(V_i1，V_i2，…，V_ix)。

For better understanding, the method further includes determining that the second target character string has a preset character string, for example, when x ═ y, determining V_iIs present in W_jFurther, it may be determined that a preset character string set W exists in the second target character string set V; otherwise, when x ≠ y, the preset character string set W does not exist in the second target character string set V.

In practical application, the second target character string set comprises two second target character strings, namely 'I' and 'Miss', and the preset character string is 'Miss'; the length of the second target character string "me" is not equal to the length of the preset character string "miss", so that the second target character string "me" does not have the preset character string "miss"; when the length of the second target character string miss is equal to that of the preset character string miss, matching two characters, namely miss and miss, in the second target character string miss with two characters, namely miss and miss, in the preset character string miss, and determining that the preset character string exists in the second target character string when the two characters are completely matched.

In the above embodiment, since the target text is large and is converted into the second target character string set, the number of the target character strings can be reduced, comparison with the weakly sensitive words is facilitated, the query rate of the weakly sensitive words can be improved, and the target character strings with the sensitive words in the target character string set can be rapidly queried through the second target character string and the preset character strings, so that omission of the target character strings with the sensitive words is avoided.

The processing method provided by this embodiment can perform different word segmentation processing methods on the character strings in the electronic file according to the file size in the electronic file, and determine the weakly sensitive words by adopting a corresponding method on the target character string set formed after word segmentation, thereby avoiding the influence on the application of the electronic file due to omission of the weakly sensitive words and improving the query rate of the weakly sensitive words.

An embodiment of the present invention further provides a computer device, including a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for processing text content as described above.

The computer device of embodiments of the present invention exists in a variety of forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

The embodiment of the present invention further provides a storage medium, which can be disposed in an electronic device to store at least one instruction or at least one program for implementing a method for processing text content in the method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for processing text content provided in the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for processing text content, the method comprising:

acquiring a size parameter of a target text;

2. The processing method according to claim 1, wherein the method further comprises obtaining the target string set by:

3. The processing method according to claim 2, wherein the target character string set is a character string set generated by arranging a plurality of target character strings in a preset order; wherein the target character string comprises a first target character string and a second target character string.

4. The process of claim 3, further comprising determining the first target string by a method comprising:

5. The processing method according to claim 4, wherein the method further comprises determining that the first target character string has the preset character by: (ii) a

6. The processing method of claim 3, further comprising determining the second target string as follows:

7. The processing method according to claim 6, wherein the method further comprises determining that the second target character string has a preset character string by:

8. The processing method according to claim 1, wherein the predetermined character string is a character string corresponding to a word in a sensitive thesaurus.

9. A computer device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the processing method according to any one of claims 1 to 8.

10. A computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement a processing method according to any one of claims 1 to 8.