CN112052676A - Text content processing method, computer equipment and storage medium - Google Patents

Text content processing method, computer equipment and storage medium Download PDF

Info

Publication number
CN112052676A
CN112052676A CN202010897035.5A CN202010897035A CN112052676A CN 112052676 A CN112052676 A CN 112052676A CN 202010897035 A CN202010897035 A CN 202010897035A CN 112052676 A CN112052676 A CN 112052676A
Authority
CN
China
Prior art keywords
character string
target
preset
target character
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010897035.5A
Other languages
Chinese (zh)
Other versions
CN112052676B (en
Inventor
郭芳
于云成
王炳功
於雪松
于志鹏
姜乃榕
刘子正
秦冲
张巍
王晓燕
沙鑫
车晨
滕建港
张英
张玉苗
张雪玮
滕瑶琪
陈林
邹承志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rongcheng Power Supply Co Of State Grid Shandong Electric Power Co
Original Assignee
Rongcheng Power Supply Co Of State Grid Shandong Electric Power Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rongcheng Power Supply Co Of State Grid Shandong Electric Power Co filed Critical Rongcheng Power Supply Co Of State Grid Shandong Electric Power Co
Priority to CN202010897035.5A priority Critical patent/CN112052676B/en
Publication of CN112052676A publication Critical patent/CN112052676A/en
Application granted granted Critical
Publication of CN112052676B publication Critical patent/CN112052676B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text content processing method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a size parameter of a target text; performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings; when the target character string is determined to have the preset character, sending out prompt information; according to the method and the device, different word segmentation processing methods can be carried out on the character strings in the electronic file according to the size of the file in the electronic file, and the weak sensitive words are determined by adopting a corresponding method on the target character string set formed after word segmentation, so that the application of the electronic file is prevented from being influenced by omission of the weak sensitive words, and the query speed of the weak sensitive words is improved.

Description

Text content processing method, computer equipment and storage medium
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a method for processing text content, a computer device, and a storage medium.
Background
At present, sensitive words often appear in electronic documents, and are generally divided into strong sensitive words and weak sensitive words, wherein the strong sensitive words are sensitive words which must be checked out and are avoided in the electronic documents, while the weak sensitive words are sensitive words which are checked out as much as possible, and although the sensitive words are sensitive, the influence is limited as long as the sensitive words are not widely spread in a large range; in the prior art, monitoring on strong sensitive words is strict, but for weak sensitive words, due to the fixation of a word segmentation dictionary and the influence of the context of the sensitive words, the weak sensitive words are accurately segmented and monitored, and further the application of electronic files is influenced.
Disclosure of Invention
In order to solve the problems in the prior art, different word segmentation processing methods can be carried out on character strings in an electronic file according to the size of the file in the electronic file, weak sensitive words are determined by adopting a corresponding method on a target character string set formed after word segmentation, the application of the electronic file is prevented from being influenced by omission of the weak sensitive words, and the query speed of the weak sensitive words is improved; the embodiment of the invention provides a text content processing method, computer equipment and a storage medium. The technical scheme is as follows:
in one aspect, a method for processing text content, the method comprising the steps of:
acquiring a size parameter of a target text;
performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings;
and sending prompt information when the target character string is determined to have the preset character.
In another aspect, a computer device includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the processing method as described above.
In another aspect, a computer-readable storage medium stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by a processor to implement the processing method as described above.
The processing method, the computer equipment and the storage medium of the text content have the following technical effects:
based on the technical scheme of the invention, different word segmentation processing methods are adopted for character strings in the electronic file according to the size of the file in the electronic file to obtain different target character string sets, and different methods are adopted for different character string sets to determine weak sensitive words; therefore, according to the technical scheme, an appropriate word segmentation processing method is selected based on the file size of the text, weak sensitive words are determined in the target character string set generated based on the word segmentation mode, the phenomenon that the word segmentation accuracy of the weak sensitive words is influenced by the word segmentation method and the context is avoided, the influence of the sensitive words on the application of the electronic file is further avoided, and the query speed of the sensitive words is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a text content processing method according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The method for processing text content provided by the embodiment of the invention can be applied to any computer equipment with data processing capability, the computer equipment can be a terminal or a server, and the computer equipment can be independently executed or executed in a cluster cooperation mode when executing the method for establishing the index table of the video library provided by the embodiment of the invention.
The present embodiment provides a method for processing text content, fig. 1 is a flowchart of a method for processing text content provided by the present embodiment, and the present specification provides the method operation steps as described in the embodiments or the flowchart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 1, the method may include the steps of:
s101, acquiring a size parameter of a target text;
specifically, the size parameter of the target text represents the data amount of the target text, wherein the data amount unit may be Byte (B, Byte), Kilobyte (KB), Megabyte (MB), Gigabyte (GB), or the like.
In the present embodiment, the target text refers to a document, such as a manuscript, a paper, a promo, etc., applied to the printing apparatus, and is not limited in the present embodiment.
S103, performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings;
specifically, the target character string set refers to character strings generated by arranging a plurality of target character strings according to a preset sequence, wherein the target character strings include a first target character string and a second target character string, and only one of the first target character string and the second target character string is a target character string generated for the same target text.
Furthermore, the rules of the preset sequence are that the target character strings are arranged according to the rules of the lengths of the target character strings from large to small, so that the comparison between the target character strings and the sensitive words in the sensitive word bank can be facilitated, and the query speed of the sensitive words is improved.
In this embodiment, the method further includes obtaining the target string set by the following method:
performing character conversion on the target text to generate a candidate character string set;
judging whether the size parameter of the target text is smaller than a preset parameter threshold value or not;
when the size parameter of the target text is smaller than the preset parameter threshold, performing word segmentation processing on the candidate character string set to obtain a first target character string set;
and when the size parameter of the target text is not smaller than the preset parameter threshold, performing word segmentation processing on the candidate character string set to obtain a second target character string set.
Specifically, the candidate character string set includes a plurality of candidate character strings, the candidate character strings refer to a chinese character string that needs to be participled, and the candidate character strings are continuous character strings without punctuation segmentation in the character strings, for example, when a character string "a party does not handle initial registration of house ownership on time or does not assist a party b in handling transfer registration of house ownership, which causes a loss to the party b, and the party a should take responsibility" is participled, punctuation segmentation is performed in the middle of the character string, which is discontinuous, and the character string cannot be used as a candidate character string; according to the positions of punctuations, dividing the character string into three sub-character strings of 'first party does not handle initial registration of house ownership on time or does not assist second party in handling transfer registration of house ownership', 'second party is lost' and 'first party should take responsibility', taking each sub-character string as a candidate character string, and then sequencing each candidate character string according to the sequence to form a candidate character string set.
Specifically, the method further includes determining the first target character string by a method including:
matching any candidate character string in the candidate character string set with a first disabled word bank;
determining a first stop word according to the matching degree of the candidate character string;
and filtering the first stop word from the candidate character string to generate the first target character string.
Further, a plurality of the first target strings form a first target string set U, where U ═ U (U ═ U)1,U2,……,Um),m≥1。
Further, the first stop word comprises a predefined word and a punctuation mark; wherein the predefined words at least comprise conjunctions, co-words, int in c-language, etc., such as "and", "or", "of", etc.
For better understanding, in the above method, the first stop word is filtered out from the candidate character string to generate the first target character string, for example, after the candidate character string is "a party does not handle the initial registration of the ownership of the house on time" or does not assist the party b in handling the transfer registration of the ownership of the house ", and the first stop word is" or ", and after the first stop word is filtered out or" is "or", two first target character strings are generated as "a party does not handle the initial registration of the ownership of the house on time" and "does not assist the party b in handling the transfer registration of the ownership of the house", respectively.
Specifically, the method further includes determining the second target character string as follows:
matching any first target character string in the first target character string set with a second disabled word bank;
determining a second stop word according to the matching degree of the first target character string;
and filtering the second stop word from the first target character string to generate the second target character string.
Further, a plurality of the second target character strings form the second target character string set V, where V ═ V (V)1,V2,……,Vn),n≥m。
Further, the second stop word includes a predetermined word or words having a literal meaning, such as, for example, insulting words such as "dying", "egg rolling", etc.
Preferably, when the second target character string is determined, the first target character string is determined using the above method, but the method of determining the first target character string is not limited in the present embodiment.
Specifically, whether the first target character string set or the second target character string set is provided, the target character strings may have different lengths, for example, some character strings are words formed by a single character, and the character strings are character strings of one character size, and some character strings are words formed by a plurality of characters, and the character strings are character strings corresponding to a plurality of characters; the difference in the lengths of the target strings is due to the fact that the target strings are divided according to semantics and are compared with the target strings and the sensitive word stock.
In the embodiment, the texts can be divided into two types through the size parameters of the texts, when the size parameters of the texts are small, the texts are beneficial to propagation, strict word segmentation processing needs to be carried out on the texts, and sensitive words are prevented from being omitted; when the size parameter of the text is large, the text is not beneficial to propagation, loose word segmentation processing can be carried out on the text, the query speed of sensitive words is improved, and the influence of the sensitive words on the application of the file is avoided.
S105, sending prompt information when the target character string is determined to have the preset character;
specifically, the preset character string refers to a character string corresponding to a word in a sensitive thesaurus, and the sensitive thesaurus is a weakly sensitive thesaurus, that is, a preset character string set W (W ═ W-1,W2,…,Wk) The weak sensitive word library can be set according to the safety expert experience, the business requirements and the like, and the weak sensitive words in the weak sensitive word library refer to the remaining sensitive words, such as the words of 'yellow', 'miss', and the like, after the strong sensitive words are removed from the whole sensitive word library.
Further, the preset character string Wj,WjA preset character string representing the jth, where j is 1, 2, …, k; the preset character string WjComprising a plurality of characters, Wj=(Wj1,Wji2,…,Wjy)。
Specifically, the prompt message includes at least one of: a voice prompt, a light prompt, or a shutdown prompt, etc., which is not limited in this embodiment.
In this embodiment, the method further includes the following steps:
comparing the length of the first target character string in the first target character string set with the length of the preset character string;
when the length of the first target character string is smaller than that of the preset character string, determining that the preset character string does not exist in the first target character string;
and when the length of the first target character string is not less than that of the preset character string, matching the first target character string with the preset character string according to a preset matching rule.
Specifically, the first target character string Ui,UiA first target character string indicating an ith character string, where i ═ 1, 2, …, m; the first target character string UiComprising a plurality of characters, Ui=(Ui1,Ui2,…,Uix)。
For better understanding, the method further includes determining that the first target string has a predetermined string, e.g., when x < y, indicating that the predetermined string is longer in length than the first target string, U may be determinediMust be absent of WjThat is, it may further be determined that the preset character string set W does not exist in the first target character string set U; and otherwise, matching the first target character string with the preset character string according to a preset matching rule.
Specifically, when the length of the first target character string is not less than the length of the preset character string, matching the first target character string with the preset character string according to a preset matching rule, and further including;
comparing the length of the first target character string with the length of the preset character string again;
when the length of the first target character string is equal to that of the preset character string, matching the first target character string with the preset character string according to a first preset matching rule;
and when the length of the first target character string is greater than that of the preset character string, matching the first target character string with the preset character string according to a second preset matching rule.
Further, the matching the first target character string with the preset character string according to a first preset matching rule includes:
matching each character in the first target character string with each character in the preset character string;
and when the matching degree of each character in the first target character string meets a preset matching degree, determining that the preset character string exists in the first target character string.
For better understanding, the first target character string is matched with the preset character string according to a first preset matching rule, for example, the length of the first target character string is equal to the length of the preset character string, that is, when x ═ y, it is determined that the matching degree of each character in the first target character string satisfies the preset matching degree, where the preset matching degree may be set to a matching degree representing that one end to be matched is completely consistent with one end to be matched, it may be understood that each character in the first target character string may be completely matched with each character in the preset character string, that is, Uiz=WjzAnd z takes a value of 1 … … y, then U is determinediIs present in WjThat is, it may further be determined that the preset character string set W exists in the first target character string set U; otherwise, when the matching degree of any character in the first target character string reaches the preset matching degree, determining UiDoes not exist WjThat is, it may further be determined that the preset character string set W does not exist in the first target character string set U.
In practical application, when the first target character string is "i am a miss", and the preset character string is "i am a miss", the first target character string is completely matched with each character of the preset character string, and it may be determined that the preset character string exists in the first target character string and is "i am a miss"; when the first target character string is "i am a miss", and the preset character string is "i am a miss", and the matching degree of the character "miss" in the first target character string and the character "little" in the preset character string does not satisfy a preset match, it may be determined that the preset character string does not exist in the first target character string, and the first target character string is "i am a miss".
Further, the matching the first target character string with the preset character string according to a second preset matching rule includes:
splitting the characters in the first target character string according to a preset step length;
and determining that the preset character string exists in the first target character string according to the matching of the split character and the preset character string.
Further, the preset step length is the length of the preset character string, that is, the characters in the first target character string and the characters in the first target character string are split according to the length of the preset character string, which can be understood as: each character in the first target character string is a starting character, and the characters are split into a plurality of characters to be matched according to the arrangement sequence by using the length of the preset character string, for example, the first target character string is Ui=abcde,WjU may be expressed as cdiSplit into ab, bc, cd and de.
Further, the split character is a character to be matched, and the first target character string is determined to have the preset character according to the matching between the split character and the preset character string, which can be understood as: matching the characters to be matched with the preset character strings in sequence, when the matching degree of any one character to be matched meets the preset matching degree, stopping matching the characters to be matched with the preset character strings, and determining that the first target character string has the preset character string, wherein the preset matching degree can be set as the matching degree representing that one end to be matched and one end to be matched are completely consistent; e.g. UjMiddle ab and WjThe cd is matched, and the matching degree of the cd does not meet the preset matching degree; proceed with UjMiddle bc and WjThe middle cd is matched, and the matching degree of the middle cd does not meet the preset matching degree; proceed with UjMiddle cd and WjThe middle cd performs matching, the matching degree meets the preset matching, and the matching is stopped, then U is determinediIs present in WjThat is, it may further be determined that the preset character string set W exists in the first target character string set U.
In practical application, the first target character string is Miss, the preset character string is Miss and can be divided into Miss, the Miss can be divided into Miss, Miss and Miss, the Miss in the first target character string is matched with the Miss in the preset character string, when the matching between the Miss in the first target character string and the Miss in the preset character string meets a preset matching degree, the matching is stopped, and the Miss in the first target character string can be determined.
Preferentially, the first target character string is any character string in a first target character string set; the preset character string is any character string in a preset character string set; this embodiment is not limited.
In the above embodiment, the first target character string and the preset character string can be used for quickly querying the target character string with the sensitive word in the target character string set, so that the target character string with the sensitive word is prevented from being omitted, and the query speed of the weakly sensitive word is improved.
In some embodiments, the method further comprises the following step of determining that the first target character string has the preset character string:
when the length of the ith first target character string is smaller than the length of the jth preset character string, and the length of the ith-1 first target character string is not smaller than the length of the jth preset character string, collecting the first target character strings
Figure BDA0002658721120000091
The length of any target character string and the preset character string Wi+1Comparing the lengths of the first target character string and the second target character string to determine that the first target character string has a preset character string; wherein U ═ U (U)1,U2,……,Ui-1)。
In the embodiment, the comparison between part of the first target character string and the preset character string can be omitted, and the query speed of the weakly sensitive words is improved.
In this embodiment, the method further includes the following steps:
comparing the length of any second target character string in the second target character string set with the length of the preset character string;
when the length of the second target character string is equal to that of the preset character string, matching each character in the second target character string with each character in the preset character string;
and when the matching degree of each character in the second target character string meets the preset matching degree, determining that the second target character string has a preset character string.
Further, the second target character string Vi,ViA second target character string indicating an ith character string, where i is 1, 2, …, n; the first target character string ViComprising a plurality of characters, Vi=(Vi1,Vi2,…,Vix)。
For better understanding, the method further includes determining that the second target character string has a preset character string, for example, when x ═ y, determining ViIs present in WjFurther, it may be determined that a preset character string set W exists in the second target character string set V; otherwise, when x ≠ y, the preset character string set W does not exist in the second target character string set V.
In practical application, the second target character string set comprises two second target character strings, namely 'I' and 'Miss', and the preset character string is 'Miss'; the length of the second target character string "me" is not equal to the length of the preset character string "miss", so that the second target character string "me" does not have the preset character string "miss"; when the length of the second target character string miss is equal to that of the preset character string miss, matching two characters, namely miss and miss, in the second target character string miss with two characters, namely miss and miss, in the preset character string miss, and determining that the preset character string exists in the second target character string when the two characters are completely matched.
In the above embodiment, since the target text is large and is converted into the second target character string set, the number of the target character strings can be reduced, comparison with the weakly sensitive words is facilitated, the query rate of the weakly sensitive words can be improved, and the target character strings with the sensitive words in the target character string set can be rapidly queried through the second target character string and the preset character strings, so that omission of the target character strings with the sensitive words is avoided.
The processing method provided by this embodiment can perform different word segmentation processing methods on the character strings in the electronic file according to the file size in the electronic file, and determine the weakly sensitive words by adopting a corresponding method on the target character string set formed after word segmentation, thereby avoiding the influence on the application of the electronic file due to omission of the weakly sensitive words and improving the query rate of the weakly sensitive words.
An embodiment of the present invention further provides a computer device, including a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for processing text content as described above.
The computer device of embodiments of the present invention exists in a variety of forms, including but not limited to:
(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include: smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.
(3) A portable entertainment device: such devices can display and play multimedia content. This type of device comprises: audio, video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.
(4) A server: the device for providing the computing service comprises a processor, a hard disk, a memory, a system bus and the like, and the server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.
(5) And other electronic devices with data interaction functions.
The embodiment of the present invention further provides a storage medium, which can be disposed in an electronic device to store at least one instruction or at least one program for implementing a method for processing text content in the method embodiment, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for processing text content provided in the method embodiment.
Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A method for processing text content, the method comprising:
acquiring a size parameter of a target text;
performing word segmentation processing on the target file according to the size parameter of the target text to obtain a target character string set, wherein the target character string set comprises a plurality of target character strings;
and sending prompt information when the target character string is determined to have the preset character.
2. The processing method according to claim 1, wherein the method further comprises obtaining the target string set by:
performing character conversion on the target text to generate a candidate character string set;
judging whether the size parameter of the target text is smaller than a preset parameter threshold value or not;
when the size parameter of the target text is smaller than the preset parameter threshold, performing word segmentation processing on the candidate character string set to obtain a first target character string set;
and when the size parameter of the target text is not smaller than the preset parameter threshold, performing word segmentation processing on the candidate character string set to obtain a second target character string set.
3. The processing method according to claim 2, wherein the target character string set is a character string set generated by arranging a plurality of target character strings in a preset order; wherein the target character string comprises a first target character string and a second target character string.
4. The process of claim 3, further comprising determining the first target string by a method comprising:
matching any candidate character string in the candidate character string set with a first disabled word bank;
determining a first stop word according to the matching degree of the candidate character string;
and filtering the first stop word from the candidate character string to generate the first target character string.
5. The processing method according to claim 4, wherein the method further comprises determining that the first target character string has the preset character by: (ii) a
Comparing the length of the first target character string in the first target character string set with the length of the preset character string;
when the length of the first target character string is smaller than that of the preset character string, determining that the preset character string does not exist in the first target character string;
and when the length of the first target character string is not less than that of the preset character string, matching the first target character string with the preset character string according to a preset matching rule.
6. The processing method of claim 3, further comprising determining the second target string as follows:
matching any first target character string in the first target character string set with a second disabled word bank;
determining a second stop word according to the matching degree of the first target character string;
and filtering the second stop word from the first target character string to generate the second target character string.
7. The processing method according to claim 6, wherein the method further comprises determining that the second target character string has a preset character string by:
comparing the length of any second target character string in the second target character string set with the length of the preset character string;
when the length of the second target character string is equal to that of the preset character string, matching each character in the second target character string with each character in the preset character string;
and when the matching degree of each character in the second target character string meets the preset matching degree, determining that the second target character string has a preset character string.
8. The processing method according to claim 1, wherein the predetermined character string is a character string corresponding to a word in a sensitive thesaurus.
9. A computer device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the processing method according to any one of claims 1 to 8.
10. A computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement a processing method according to any one of claims 1 to 8.
CN202010897035.5A 2020-08-31 2020-08-31 Text content processing method, computer equipment and storage medium Expired - Fee Related CN112052676B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010897035.5A CN112052676B (en) 2020-08-31 2020-08-31 Text content processing method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010897035.5A CN112052676B (en) 2020-08-31 2020-08-31 Text content processing method, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112052676A true CN112052676A (en) 2020-12-08
CN112052676B CN112052676B (en) 2021-09-07

Family

ID=73608047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010897035.5A Expired - Fee Related CN112052676B (en) 2020-08-31 2020-08-31 Text content processing method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112052676B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766236A (en) * 2021-03-10 2021-05-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium
CN113515605A (en) * 2021-05-20 2021-10-19 河南光悦网络科技有限公司 Intelligent robot question-answering method based on artificial intelligence and intelligent robot

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253983A (en) * 2011-06-28 2011-11-23 北京新媒传信科技有限公司 Method and system for identifying Chinese high-risk words
JP2015011563A (en) * 2013-06-28 2015-01-19 カシオ計算機株式会社 Word conjugated form output device, program, electronic equipment including dictionary function and word conjugated form output system
CN110543637A (en) * 2019-09-06 2019-12-06 知者信息技术服务成都有限公司 Chinese word segmentation method and device
CN110928931A (en) * 2020-02-17 2020-03-27 深圳市琦迹技术服务有限公司 Sensitive data processing method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253983A (en) * 2011-06-28 2011-11-23 北京新媒传信科技有限公司 Method and system for identifying Chinese high-risk words
JP2015011563A (en) * 2013-06-28 2015-01-19 カシオ計算機株式会社 Word conjugated form output device, program, electronic equipment including dictionary function and word conjugated form output system
CN110543637A (en) * 2019-09-06 2019-12-06 知者信息技术服务成都有限公司 Chinese word segmentation method and device
CN110928931A (en) * 2020-02-17 2020-03-27 深圳市琦迹技术服务有限公司 Sensitive data processing method and device, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766236A (en) * 2021-03-10 2021-05-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium
CN112766236B (en) * 2021-03-10 2023-04-07 拉扎斯网络科技(上海)有限公司 Text generation method and device, computer equipment and computer readable storage medium
CN113515605A (en) * 2021-05-20 2021-10-19 河南光悦网络科技有限公司 Intelligent robot question-answering method based on artificial intelligence and intelligent robot
CN113515605B (en) * 2021-05-20 2023-12-19 中晨田润实业有限公司 Intelligent robot question-answering method based on artificial intelligence and intelligent robot

Also Published As

Publication number Publication date
CN112052676B (en) 2021-09-07

Similar Documents

Publication Publication Date Title
US10565244B2 (en) System and method for text categorization and sentiment analysis
CN105045781B (en) Query term similarity calculation method and device and query term search method and device
US9471644B2 (en) Method and system for scoring texts
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
CN104866478B (en) Malicious text detection and identification method and device
CN112052676B (en) Text content processing method, computer equipment and storage medium
CN111126060B (en) Method, device, equipment and storage medium for extracting subject term
CN109508391B (en) Input prediction method and device based on knowledge graph and electronic equipment
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN107168966B (en) Search engine index construction method and device
US20150169676A1 (en) Generating a Table of Contents for Unformatted Text
CN109271641A (en) A kind of Text similarity computing method, apparatus and electronic equipment
CN109656385B (en) Input prediction method and device based on knowledge graph and electronic equipment
CN110808065A (en) Method and device for detecting refrain, electronic equipment and storage medium
CN113988061A (en) Sensitive word detection method, device and equipment based on deep learning and storage medium
CN110705250A (en) Method and system for identifying target content in chat records
CN113806486A (en) Long text similarity calculation method and device, storage medium and electronic device
CN109885831B (en) Keyword extraction method, device, equipment and computer readable storage medium
CN110287286B (en) Method and device for determining similarity of short texts and storage medium
CN111401039A (en) Word retrieval method, device, equipment and storage medium based on binary mutual information
CN114298007A (en) Text similarity determination method, device, equipment and medium
CN110245357B (en) Main entity identification method and device
CN109508390B (en) Input prediction method and device based on knowledge graph and electronic equipment
CN111339778A (en) Text processing method, device, storage medium and processor
CN112182448A (en) Page information processing method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210907

CF01 Termination of patent right due to non-payment of annual fee