CN113434641A

CN113434641A - Multithreading mask word query replacement method

Info

Publication number: CN113434641A
Application number: CN202110844719.3A
Authority: CN
Inventors: 邝剑洪; 罗培羽; 张永明; 李勇; 刘效法
Original assignee: Guangzhou 4399 Information Technology Co ltd
Current assignee: Guangzhou 4399 Information Technology Co ltd
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-09-24
Anticipated expiration: 2041-07-26
Also published as: CN113434641B

Abstract

The invention provides a multithreading mask word query and replacement method, which comprises the following steps: acquiring the number of cores of the CPU, wherein the number of the cores is N, and then starting N threads with corresponding number; distributing the m mask words to N threads according to a distribution algorithm; executing N threads in parallel; if the query mode is the mask word query mode, executing a mask word query process; and if the replacement mode is queried for the mask word, executing a mask word query replacement process. The multithreading mask word query replacement method provided by the invention has the following advantages: the invention provides a multithreading mask word query and replacement method, which adopts a mask word multithreading parallel query method to merge the obtained mask word query results and then intensively replace the mask words, thereby not only improving the accuracy and comprehensiveness of the mask word query and replacement, but also greatly improving the efficiency of the mask word query and replacement.

Description

Multithreading mask word query replacement method

Technical Field

The invention belongs to the technical field of data query, and particularly relates to a multithreading mask word query replacement method.

Background

Mask words, also known as sensitive words, refer to words with a sensitive political inclination, violence inclination, unhealthy color, or non-civilized language. At present, various content websites need to perform screening word examination on contents to be released, and release is allowed only when the screening word does not exist in the examination.

The traditional mask word auditing mode is as follows: and sequentially inquiring the contents to be issued by adopting the mask words one by one, and replacing the inquired mask words in real time. For example, two masks need to be queried, respectively: mask 1, whose character string is: AAA; mask 2, whose character string is: AAABBB. The original string that needs to be audited for content is "xxxaaabbxxxaaaaccc". Therefore, the original character string is firstly inquired by using the mask word 1, and the inquired mask word 1 is replaced by "+" so as to obtain the first character string of "XXX BBBXXX CCC"; then, the first character string is inquired by using the mask 2, and at the moment, the mask 2 cannot be inquired in the first character string. Thus, the processed character string, i.e. the first character string, is "XXX by BBBXXX by CCC".

This approach has the following problems: (1) generally, the number of masks is large, and the number of masks increases in proportion to the duration of time, resulting in inefficient mask query replacement. (2) If a plurality of shielding words are overlapped, when the shielding word is inquired and replaced, the result of the inquiry and replacement of the shielding word at the front side can influence the inquiry result of the shielding word at the rear side, so that the inquiry and replacement of the shielding word are incomplete. For the previous example of the mask 1 and the mask 2, the original character string AAABBB is destroyed after the mask 1 is used for query and replacement, so that when the mask 2 is used for query later, the character string AAABBB cannot be queried, and the character string BBB still having sensitivity cannot be processed, which results in incomplete query and replacement of the mask.

How to solve the problems, improve the efficiency of the query and replacement of the mask words, and improve the accuracy and comprehensiveness of the query and replacement results of the mask words is a problem which needs to be solved urgently at present.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a multithreading mask word query and replacement method, which can effectively solve the problems.

The technical scheme adopted by the invention is as follows:

the invention provides a multithreading mask word query and replacement method, which comprises the following steps:

step 1, determining Content needing to be subjected to mask word processing; for Content, a mask range is determined, assuming m masks, denoted as: a mask Word (1), a mask Word (2), a mask Word (m);

step 2, acquiring the core number of the CPU, wherein the core number is N, and then starting N threads with corresponding number;

step 3, distributing the m mask words to N threads according to a distribution algorithm;

step 4, executing N threads in parallel; if the query mode is the mask word query mode, executing step 5; if the replacement mode is inquired for the mask word, executing step 6;

step 5, a mask word query mode:

step 5.1, executing N threads in parallel;

for any thread, denoted as thread P_iInquiring the Content by taking at least one mask word allocated to the Content as an inquiry key word, and immediately returning a notification message that the Content has the mask word to the control module as long as the Content inquires that the corresponding mask word exists; otherwise, if the corresponding mask word is not inquired in the Content when the Content is inquired, returning a notification message that the Content does not have the mask word to the control module;

step 5.2, when the control module receives a notification message that the Content has the shielding word and is returned by any thread, the control module obtains the conclusion that the Content has the shielding word, and simultaneously immediately sends an instruction for stopping continuously inquiring the Content to the N threads, and the flow is ended;

if the control module receives a notification message that the Content does not have the shielding word returned by all threads, the control module obtains the conclusion that the Content does not have the shielding word, and ends the flow;

step 6, the mask word inquiry replacement mode:

6.1, executing N threads in parallel;

for any thread, denoted as thread P_iInquiring the Content by taking at least one shielding word allocated to the Content as an inquiry key word, and if the corresponding shielding word is not inquired in the Content, returning a notification message that the Content does not have the shielding word to the control module; if the corresponding mask word is inquired in the Content, the position of the mask word in the Content and the length of the mask word form a two-dimensional array { position, length }, and after the inquiry is finished, all the inquired two-dimensional arrays { position, length } are sent to a control module;

step 6.2, the control module collects the two-dimensional arrays { position, length } returned by the N threads, and analyzes and processes each two-dimensional array { position, length } to obtain the processed two-dimensional arrays { position, length };

6.3, for each processed two-dimensional array { position, length }, positioning a corresponding character string in the Content according to the position and the length, and replacing the corresponding character string with a non-sensitive character by adopting a replacement algorithm; and ending the flow.

Preferably, in step 3, the allocation algorithm is an average allocation algorithm, that is: the m masks are equally allocated to the N threads.

Preferably, the allocation algorithm is as follows: the allocation of the mask word is performed according to the following principle:

sorting all the shielding words according to the first character of the shielding words;

when a thread needs to be allocated with a plurality of masks, masks with the same or similar first characters are allocated to the same thread, and the number of the masks allocated to each thread is recorded.

Preferably, in step 6.1, the position of the mask in the Content and the length of the mask are:

location of the mask in the Content: when the mask word is positioned in the Content, the position of the first character of the mask word in the Content is the position of the mask word in the Content;

the length of the mask refers to: the mask string includes the number of characters.

Preferably, step 6.2 is specifically:

comparing the two-dimensional arrays { position, length }, if the two-dimensional array R1{ position C1, length L1} completely contains the two-dimensional array R2{ position C2, length L2}, that is: if the position C1 is not more than the position C2, the position C1+ the length L1 is not less than the position C2+ the length L2, directly deleting the two-dimensional array R2;

if the two-dimensional array R1{ position C1, length L1} and the two-dimensional array R2{ position C2, length L2} are partially overlapped or are just adjacent, namely: simultaneously, the following conditions are met:

condition 1: the position C2 is more than or equal to the position C1;

condition 2: position C2 ≤ position C1+ length L1-1, or position C2- (position C1+ length L1-1) ═ 1;

condition 3: the length of the position C2+ L2-1 is more than or equal to the length of the position C1+ L1-1;

the two-dimensional array R1 and the two-dimensional array R2 are merged to obtain a new two-dimensional array R3{ position C1, length-position C2+ length L2-position C1 }.

The multithreading mask word query replacement method provided by the invention has the following advantages:

the invention provides a multithreading mask word query and replacement method, which adopts a mask word multithreading parallel query method to merge the obtained mask word query results and then intensively replace the mask words, thereby not only improving the accuracy and comprehensiveness of the mask word query and replacement, but also greatly improving the efficiency of the mask word query and replacement.

Drawings

Fig. 1 is a schematic flowchart of a multithread mask word query replacement method according to the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The main conception of the invention is as follows:

the invention provides a multithreading mask word query and replacement method, which effectively overcomes the defect that the traditional serial queuing is used for searching and replacing mask words, and greatly improves the query efficiency by utilizing the multi-core calculation of a cpu. The principle is as follows: according to the core number/multithread technical parameter of the CPU, the number of matched threads is started, the mask words needing to be searched are evenly distributed in each thread, each thread is responsible for inquiring the distributed mask words, and the more the core number of the CPU is, the higher the efficiency is. If only whether the query content contains the mask word or not, any one of the threads returns a failure result, and then failure can be judged; wherein, failure means: inquiring existence of a mask word; thus, the response time is greatly improved. If the mask word needs to be replaced, the list can be merged and then replaced through the position and length list of the mask word returned by each thread, so that the threads cannot be influenced with each other, the mask word replacement frequency is further reduced through a merging mode, and the efficiency is improved.

Referring to fig. 1, the method for replacing a multithreaded mask word query provided by the present invention includes the following steps:

for example, the allocation algorithm is an equal allocation algorithm, namely: the m masks are equally allocated to the N threads.

In order to minimize the time difference of searching the mask words by each thread, the advantage of multiple cores can be exerted to the greatest extent, and therefore, the more reasonable the distribution of the mask words, the higher the efficiency. The invention recommends the allocation of masks in the following way:

Thus, each thread records two metrics: the number of masks assigned to. By the two indexes, the mask words born by each thread are controlled, and the threads are queried by various algorithms including but not limited to KMP, BM and AC algorithms, and can be optimized by matching with a mask word allocation scheme.

step 5, a mask word query mode:

step 5.1, executing N threads in parallel;

step 6, the mask word inquiry replacement mode:

6.1, executing N threads in parallel;

wherein, the position of the mask in the Content and the length of the mask refer to:

For example: the mask is the character string "CCC" and the Content is the character string "xxxaaaabbbxxxaaacc", then the returned two-dimensional array is: {15,3}, where the location is measured starting from 0 in the Content, so the location of the mask prefix character "C" in the Content is the 15 th bit. The character string "CCC" is 3 characters in length.

For example: assuming that AAA, AAABBB, and CCC are masks, the result {3,3}, {3,6}, {12,3}, and {15,3} is returned for the character string "xxxaababbxxaaaccc".

the step 6.2 is specifically as follows:

condition 1: the position C2 is more than or equal to the position C1;

condition 2: the position C2 is not more than the position C1+ the length L1-1 (in the case of partial overlapping of two-dimensional data), or the position C2- (the position C1+ the length L1-1) ═ 1 (in the case of two-dimensional data being adjacent);

For example, for four two-dimensional arrays {3,3}, {3,6}, {12,3}, {15,3 }.

If {3,6} is two-dimensional array R1 and {3,3} is two-dimensional array R2, then two-dimensional array R1 fully contains two-dimensional array R2, deletes two-dimensional array R2;

for {12,3}, {15,3}, {15,3} is a two-dimensional array R2, {12,3} is a two-dimensional array R1, a two-dimensional array R2 and a two-dimensional array R1 are in an adjacent relationship, and a merged two-dimensional array R3 is {12,6 }.

Therefore, after processing the four two-dimensional arrays {3,3}, {3,6}, {12,3}, {15,3}, the final result is {3,6}, {12,6}, i.e., from replacing 4 masks, to optimizing to replacing 2 masks.

In summary, the present invention provides a multithreading mask word query and replacement method, which adopts a mask word multithreading parallel query method to merge the obtained mask word query results and then perform mask word replacement in a centralized manner, so as to improve the accuracy and comprehensiveness of mask word query and replacement, and greatly improve the efficiency of mask word query and replacement.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A multi-threaded mask word query replacement method, comprising the steps of:

step 5, a mask word query mode:

step 5.1, executing N threads in parallel;

for any thread, denoted as thread P_iInquiring the Content by taking at least one mask word allocated to the Content as an inquiry key word, and immediately returning a notification message that the Content has the mask word to the control module as long as the Content inquires that the corresponding mask word exists; otherwise, if the Content query is completed, the Content is not in the Content CIf the corresponding shielding word is inquired in the ontent, a notification message that the Content does not have the shielding word is returned to the control module;

step 6, the mask word inquiry replacement mode:

6.1, executing N threads in parallel;

2. The method as claimed in claim 1, wherein in step 3, the allocation algorithm is an average allocation algorithm, that is: the m masks are equally allocated to the N threads.

3. The method of claim 1, wherein the allocation algorithm is: the allocation of the mask word is performed according to the following principle:

4. A multi-thread mask word query replacement method as claimed in claim 1, wherein in step 6.1, the position of the mask word in the Content and the length of the mask word refer to:

5. The method of claim 4, wherein the step 6.2 is specifically:

condition 1: the position C2 is more than or equal to the position C1;