CN115455987A - Character grouping method based on word frequency and word frequency, storage medium and electronic equipment - Google Patents

Character grouping method based on word frequency and word frequency, storage medium and electronic equipment Download PDF

Info

Publication number
CN115455987A
CN115455987A CN202211416941.4A CN202211416941A CN115455987A CN 115455987 A CN115455987 A CN 115455987A CN 202211416941 A CN202211416941 A CN 202211416941A CN 115455987 A CN115455987 A CN 115455987A
Authority
CN
China
Prior art keywords
characters
character
state transition
word frequency
transition matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211416941.4A
Other languages
Chinese (zh)
Other versions
CN115455987B (en
Inventor
田辉
朱鹏远
鲁国峰
郭玉刚
张志翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei High Dimensional Data Technology Co ltd
Original Assignee
Hefei High Dimensional Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei High Dimensional Data Technology Co ltd filed Critical Hefei High Dimensional Data Technology Co ltd
Priority to CN202211416941.4A priority Critical patent/CN115455987B/en
Publication of CN115455987A publication Critical patent/CN115455987A/en
Application granted granted Critical
Publication of CN115455987B publication Critical patent/CN115455987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Optimization (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention particularly relates to a character grouping method based on word frequency and word frequency, a storage medium and electronic equipment, wherein the character grouping method comprises the following steps: traversing the corpus, and calculating the probability of N characters to be grouped and the probability of each word formed by the N characters; calculating a state transition matrix according to the character probability and the word probability; normalizing the state transition matrix to obtain a normalized state transition matrix; traversing the characters one by one, calculating the weight of the character c to be distributed to all the groups, adding the character c to the group with the maximum weight, positively correlating the weight with the expected value of the group number of the random binary characters, and repeating the steps until all the characters are grouped. Through the weight calculation formula, the weights of two characters which often appear together are increased when the two characters are divided into different groups, then the groups with the largest weights are selected to make the characters appearing together different as much as possible, reasonable grouping of the characters is realized, and the number of the characters in each group is not limited by the grouping mode, so that the grouping is more reasonable.

Description

Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
Technical Field
The invention relates to the technical field of word stock invisible watermarks, in particular to a character grouping method based on word frequency and word frequency, a storage medium and electronic equipment.
Background
In the existing text watermarking technology, in order to improve the robustness of a watermarking algorithm against malicious attacks such as printing and scanning, screen capture, screen shooting and the like, a text digital watermarking technology based on character topological structure modification becomes a mainstream. The character deformation data is stored in a specific watermark font library by corresponding to different watermark information bit strings after the specific characters are deformed in different forms, and the watermark information is embedded by font replacement in the process of printing and outputting electronic text documents and displaying screens. When we use different character deformation data for different users, the specific watermark word stock constitutes the safe word stock for the user.
The prior secure word stock has many defects, and in order to solve the problems of poor watermark loading universality, poor system stability, complex implementation process, low watermark algorithm robustness and the like in the prior art on the premise of not changing any use habits of users, the following scheme is disclosed in a patent of a general text watermarking method and device (publication number: CN 114708133A) applied by Beijing national crypto-technology Limited company: a general text watermarking method, comprising the steps of: grouping a certain number of characters in the selected word stock according to a specific strategy; performing deformation design on all characters in each group according to a specific rule, and generating a watermark character data temporary file; generating watermark coding data of the user terminal to identify the identity authentication information of the user terminal; dynamically generating and loading a watermark font file in real time according to the watermark encoding data and by combining the watermark character data temporary file and the grouped characters; and running the text file in an electronic format, and embedding watermark information in the document content data printed out of the file and displayed on a screen in real time by using the watermark font file.
In this scheme, characters need to be grouped. When characters are grouped, theoretically, the characters with higher word frequency should be respectively located in different groups; the characters, which often appear together, should be located in different groups, respectively. The safety word stock generated by meeting the two requirements needs less word content when the safety code is extracted, so that the extraction effect and the accuracy are better. The character grouping method in the scheme has a plurality of defects: first, the number of characters in each group is substantially equal, which conflicts with the above requirement; secondly, only the word frequency is considered during grouping, the word frequency is not fully considered, theoretically, corresponding characters in frequently-occurring words should be grouped into different groups, so that more groups can appear in shorter contents, and fewer contents are required during extraction of the security codes; thirdly, the calculation process for optimizing the packets in the scheme is too complex, and a large amount of time and calculation power are consumed.
Disclosure of Invention
The invention aims to provide a character grouping method based on word frequency and word frequency, which can more reasonably group characters.
In order to realize the purpose, the invention adopts the technical scheme that: a character grouping method based on word frequency and word frequency comprises the following steps: traversing the corpus, and calculating the probability of each character according to the frequency of occurrence of N characters to be grouped
Figure 453747DEST_PATH_IMAGE001
Dividing words for all texts in the corpus, and calculating the probability of each word according to the frequency of occurrence of the words composed of N characters
Figure 213630DEST_PATH_IMAGE002
(ii) a According to
Figure 55684DEST_PATH_IMAGE001
And
Figure 529522DEST_PATH_IMAGE002
calculating the probability that one character is followed by another character to obtain a state transition matrix
Figure 609474DEST_PATH_IMAGE003
(ii) a For state transition matrix
Figure 963095DEST_PATH_IMAGE003
Normalizing to make the sum of the probabilities of one character followed by other characters be 1 to obtain a normalized state transition matrix
Figure 761286DEST_PATH_IMAGE004
(ii) a Traversing the characters one by one, calculating the weight of the character c to be distributed to all the groups, adding the character c to the group with the maximum weight, positively correlating the weight with the expected value of the group number of the random binary characters, and repeating the steps until all the characters are grouped.
Compared with the prior art, the invention has the following technical effects: the grouping scheme mainly groups characters from the association among words, distributes a plurality of characters which often appear as a word in different groups as much as possible, the probability that one character is followed by another character is reflected by a state transition matrix, and the weights of the two character groups which often appear together in different groups are increased through a weight calculation formula, so that the characters which appear together can be grouped as different as possible by selecting the group with the largest weight, reasonable grouping of the characters is realized, and the number of the characters in each group is not limited by the grouping mode, so that the character grouping is more reasonable.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention is described in further detail below with reference to fig. 1.
Referring to fig. 1, the invention discloses a character grouping method based on word frequency, comprising the following steps: traversing the corpus, and calculating the probability of each character according to the frequency of occurrence of N characters to be grouped
Figure 6192DEST_PATH_IMAGE001
And the optimal value range of N is 1000 to 3000, and N characters with higher character frequency are selected by sequencing the character frequency of the characters. There are many word segmentation models, we choose the more mature oneA word segmentation model for segmenting all texts in the corpus and calculating the probability of each word according to the frequency of occurrence of the words consisting of N characters
Figure 206229DEST_PATH_IMAGE002
. The word frequency and the word frequency can be calculated by using the existing corpus and model, and the calculated result can also be directly adopted. The corpus can be selected according to the requirements of users, namely a general corpus can be selected, an internal corpus of a certain enterprise or organization can be selected, and the obtained character groups are different for different corpora.
According to
Figure 730751DEST_PATH_IMAGE001
And
Figure 219501DEST_PATH_IMAGE002
calculating the probability that one character is followed by another character to obtain a state transition matrix
Figure 487672DEST_PATH_IMAGE003
The number of rows and columns of the matrix is equal to the number of characters N, and the state transition matrix
Figure 558527DEST_PATH_IMAGE003
Element (1) of
Figure 722792DEST_PATH_IMAGE005
Representing characters
Figure 761155DEST_PATH_IMAGE006
Followed by characters
Figure 770700DEST_PATH_IMAGE007
By constructing a state transition matrix
Figure 210908DEST_PATH_IMAGE003
Thereby establishing a character-to-character relationship. In particular, the state transition matrix
Figure 280495DEST_PATH_IMAGE003
Element (1) of
Figure 806155DEST_PATH_IMAGE005
Can be calculated according to the following formula:
Figure 619390DEST_PATH_IMAGE008
in the formula (I), the compound is shown in the specification,
Figure 903653DEST_PATH_IMAGE009
is the sum of the probabilities of a particular word, the characters in that particular word
Figure 472037DEST_PATH_IMAGE006
And characters
Figure 157096DEST_PATH_IMAGE007
Adjacent and in sequence. That is, the words and phrases used herein
Figure 836339DEST_PATH_IMAGE010
Comprises that
Figure 923244DEST_PATH_IMAGE011
Or
Figure 662530DEST_PATH_IMAGE012
Or
Figure 382356DEST_PATH_IMAGE013
Such words, essential characters
Figure 537393DEST_PATH_IMAGE006
Preceding, character
Figure 806701DEST_PATH_IMAGE007
At the back and with two characters arranged adjacently, not including
Figure 654571DEST_PATH_IMAGE014
Or
Figure 376539DEST_PATH_IMAGE015
Such words. Since long words containing other words can be separated during word segmentation, summation is needed; and continuous words that do not form words are ignored, the state transition matrix is calculated
Figure 335268DEST_PATH_IMAGE003
There are many elements with a value of 0, so further normalization is required.
Further, for the state transition matrix
Figure 193503DEST_PATH_IMAGE003
Normalizing to make the sum of the probabilities of one character followed by other characters be 1 to obtain a normalized state transition matrix
Figure 212274DEST_PATH_IMAGE004
(ii) a The state transition matrix can uniquely represent a Markov chain, and after the matrix is solved, the modeling from the corpus to the language model is completed. Specifically, the state transition matrix is reset as follows
Figure 670806DEST_PATH_IMAGE003
Elements in (1) which are 0:
Figure 229964DEST_PATH_IMAGE016
in the formula (I), the compound is shown in the specification,
Figure 100002_DEST_PATH_IMAGE017
as a state transition matrix
Figure 942705DEST_PATH_IMAGE003
The sum of all elements in the ith row,
Figure 663536DEST_PATH_IMAGE018
is a state transition matrix
Figure 297780DEST_PATH_IMAGE003
The sum of the character probabilities corresponding to the characters of which all the elements in the ith row are 0. If a character does not form a word with any other character, then the state transition matrix
Figure 411360DEST_PATH_IMAGE003
The values of the row elements are all 0, and after normalization, the values of the row elements are the probabilities of the characters.
When we get the normalized state transition matrix
Figure 447449DEST_PATH_IMAGE004
Later, to be able to better group the characters, we consider such a scenario: all characters in the character set are grouped, at the moment, a new character c needs to be grouped, only the optimal grouping of the character c to be distributed needs to be calculated, the idea is repeated, and after the optimal grouping is calculated for each character, the obtained grouping is the optimal grouping of the N characters. Then, how to determine the best grouping of a certain character is determined by introducing a weight.
Firstly, the expected value of the group number of random binary characters under a language model is defined as G for measuring the grouping effect, and a calculation formula of the G value corresponding to N grouped characters is as follows:
Figure 604761DEST_PATH_IMAGE019
wherein g represents the number of different groups contained in the binary character when the character is a character
Figure 523039DEST_PATH_IMAGE020
When the components are divided into the same group,
Figure DEST_PATH_IMAGE021
when the character is
Figure 423999DEST_PATH_IMAGE020
When the components are divided into different groups, the components are separated into different groups,
Figure 49015DEST_PATH_IMAGE022
Figure 892075DEST_PATH_IMAGE023
i.e. character
Figure 235332DEST_PATH_IMAGE006
Followed by the character
Figure 939982DEST_PATH_IMAGE007
Is a probability of
Figure 747401DEST_PATH_IMAGE024
By the definition of G, we can see that,
Figure 449778DEST_PATH_IMAGE023
the larger the size, the best grouping is of characters
Figure 342648DEST_PATH_IMAGE020
The larger the value of G reflected on G is, the different groups are divided into. Therefore, we only need to calculate the G value of the character c to be assigned in each group, and when the G value is larger, it indicates that the grouping effect is best.
Therefore, in the first embodiment of the present invention, the expected value G of the group number of the random binary character is directly used as the weight, and specifically, in the step of calculating the weight of the character c to be assigned to all the groups, the character c to be assigned to the kth group is calculated according to the following formula
Figure 788673DEST_PATH_IMAGE025
Weight of time:
Figure 201331DEST_PATH_IMAGE026
wherein A is the grouped character and the word to be assignedThe set of symbols c is formed by the symbols c,
Figure 74609DEST_PATH_IMAGE027
i.e. normalized state transition matrix
Figure 454774DEST_PATH_IMAGE004
Chinese character
Figure 235649DEST_PATH_IMAGE006
Corresponding line and character
Figure 955343DEST_PATH_IMAGE007
Corresponding to the element values of the column. In this embodiment, when one character is assigned, the G value corresponding to the character in each group is calculated.
An embodiment, as the grouped characters become more and more, the following calculation amount becomes more and more. To increase the processing speed, we change the idea to find the best packet by calculating the increase of G. The character c to be distributed is divided into the k group
Figure 61839DEST_PATH_IMAGE025
The increase in G is:
Figure 866984DEST_PATH_IMAGE028
wherein the first two terms are independent of k, i.e. independent of the grouping method. From the derivation process above, we can define the weights in two ways.
In the second embodiment, in the step of calculating the weight of the character c to be assigned to all the groups, the character c to be assigned to the kth group is calculated according to the following formula
Figure 966396DEST_PATH_IMAGE025
Weight of time:
Figure DEST_PATH_IMAGE029
in the formula (I), the compound is shown in the specification,
Figure 337335DEST_PATH_IMAGE030
i.e. normalized state transition matrix
Figure 552415DEST_PATH_IMAGE004
The middle character c corresponds to the line and character
Figure 641594DEST_PATH_IMAGE006
The value of the element of the corresponding column.
In the third embodiment, in the step of calculating the weight of the character c to be assigned to all the groups, the character c to be assigned to the kth group is calculated according to the following formula
Figure 29850DEST_PATH_IMAGE025
Weight of time:
Figure 458557DEST_PATH_IMAGE031
in the formula (I), the compound is shown in the specification,
Figure 657589DEST_PATH_IMAGE030
i.e. normalized state transition matrix
Figure 437326DEST_PATH_IMAGE004
The middle character c corresponds to the line and character
Figure 629273DEST_PATH_IMAGE006
Corresponding to the element values of the column.
The value of the group K is taken as required, for example, 30 can be taken, and then, at this time, the value is taken
Figure 912486DEST_PATH_IMAGE032
Is greater than
Figure DEST_PATH_IMAGE033
The number of summation terms (c), i.e. the calculation amount in the second embodiment is much less than that in the first embodiment, but still more than that in the third embodimentThe amount is calculated. Therefore, we preferably adopt the weight calculation actually
Figure 531687DEST_PATH_IMAGE034
As the weight.
From the above description, it can be seen that the random binary character contains the expected value G of the group number itself, the increment of G, or simply the expected value G
Figure 595458DEST_PATH_IMAGE034
All of which are positively correlated with the expected value of the number of groups of random binary characters. In addition to the three weights mentioned here, other weights may be set as long as they are positively correlated with G.
Further, in the step of traversing the characters one by one and calculating the weights of the characters c to be distributed to all the groups, the characters are traversed one by one according to the sequence of the word frequency from high to low. Each character c is distributed, namely a local optimal solution is solved, the characters are traversed from high to low according to the character frequency, and the overall optimal solution, namely the optimal solution after all the characters are grouped, can be obtained.
The invention also discloses a computer readable storage medium and an electronic device. In particular, a computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the word frequency based character grouping method as described above. An electronic device comprises a memory, a processor and a computer program stored on the memory, wherein the processor implements the character grouping method based on word frequency and word frequency when executing the computer program.

Claims (10)

1. A character grouping method based on word frequency and word frequency is characterized in that: the method comprises the following steps:
traversing the corpus, and calculating the probability of each character according to the frequency of occurrence of N characters to be grouped
Figure DEST_PATH_IMAGE001
To speech materialAll texts in the library are divided into words, and the probability of each word is calculated according to the frequency of occurrence of the words formed by N characters
Figure 376003DEST_PATH_IMAGE002
According to
Figure 293144DEST_PATH_IMAGE001
And
Figure 10564DEST_PATH_IMAGE002
calculating the probability that one character is followed by another character to obtain a state transition matrix
Figure DEST_PATH_IMAGE003
For state transition matrix
Figure 405773DEST_PATH_IMAGE003
Normalizing to make the sum of the probabilities of other characters after a character be 1 to obtain a normalized state transition matrix
Figure 220146DEST_PATH_IMAGE004
Traversing the characters one by one, calculating the weight of the character c to be distributed to all the groups, adding the character c to the group with the maximum weight, positively correlating the weight with the expected value of the group number of the random binary characters, and repeating the steps until all the characters are grouped.
2. The method for grouping characters based on word frequency as claimed in claim 1, wherein: the expected value of the group number of the random binary characters after the N characters are grouped is calculated by the following formula:
Figure 42608DEST_PATH_IMAGE006
in the formula (I), the compound is shown in the specification,g represents the number of different groups contained in the binary character,
Figure DEST_PATH_IMAGE007
i.e. normalized state transition matrix
Figure 247325DEST_PATH_IMAGE004
Chinese character
Figure 446225DEST_PATH_IMAGE008
Corresponding line and character
Figure DEST_PATH_IMAGE009
Corresponding to the element values of the column.
3. The method for grouping characters based on word frequency as claimed in claim 1, wherein: the state transition matrix
Figure 115103DEST_PATH_IMAGE003
Element (1) of
Figure 46150DEST_PATH_IMAGE010
Representing characters
Figure 800480DEST_PATH_IMAGE008
Followed by characters
Figure 537492DEST_PATH_IMAGE009
Is calculated according to the following formula:
Figure 60877DEST_PATH_IMAGE012
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE013
is the sum of the probabilities of a particular word, the characters in that particular word
Figure 428404DEST_PATH_IMAGE008
And characters
Figure 670030DEST_PATH_IMAGE009
Adjacent and in sequence.
4. The method for grouping characters based on word frequency as claimed in claim 1, wherein: the pair state transition matrix
Figure 945153DEST_PATH_IMAGE003
Normalization refers to resetting the state transition matrix as follows
Figure 588624DEST_PATH_IMAGE003
Elements with a value of 0 in (1):
Figure DEST_PATH_IMAGE015
in the formula (I), the compound is shown in the specification,
Figure 861474DEST_PATH_IMAGE016
is a state transition matrix
Figure 590395DEST_PATH_IMAGE003
The sum of all the elements in the ith row,
Figure DEST_PATH_IMAGE017
is a state transition matrix
Figure 669210DEST_PATH_IMAGE003
The sum of the character probabilities corresponding to the characters of which all the elements in the ith row are 0.
5. The method for grouping characters based on word frequency as claimed in claim 1, wherein: the calculation of the character c to be assigned is assigned to all groupsIn the weighting step, the assignment of the character c to be assigned to the kth group is calculated as follows
Figure 167187DEST_PATH_IMAGE018
Weight of time:
Figure 876517DEST_PATH_IMAGE020
wherein A is a set of grouped characters and characters c to be allocated,
Figure 827156DEST_PATH_IMAGE007
i.e. normalized state transition matrix
Figure 709661DEST_PATH_IMAGE004
Chinese character
Figure 62145DEST_PATH_IMAGE008
Corresponding line and character
Figure 4693DEST_PATH_IMAGE009
The value of the element of the corresponding column.
6. The method for grouping characters based on word frequency as claimed in claim 1, wherein: in the step of calculating the weight of the character c to be distributed to all groups, the weight of the character c to be distributed to the kth group is calculated according to the following formula
Figure 380311DEST_PATH_IMAGE018
Weight of time:
Figure 800928DEST_PATH_IMAGE022
in the formula (I), the compound is shown in the specification,
Figure DEST_PATH_IMAGE023
i.e. normalized state transition matrix
Figure 273498DEST_PATH_IMAGE004
The middle character c corresponds to the line and character
Figure 386947DEST_PATH_IMAGE008
Corresponding to the element values of the column.
7. The method for grouping characters based on word frequency as claimed in claim 1, wherein: in the step of calculating the weight of the character c to be distributed to all groups, the weight of the character c to be distributed to the kth group is calculated according to the following formula
Figure 246931DEST_PATH_IMAGE018
Weight of time:
Figure DEST_PATH_IMAGE025
in the formula (I), the compound is shown in the specification,
Figure 471239DEST_PATH_IMAGE023
i.e. normalized state transition matrix
Figure 532736DEST_PATH_IMAGE004
The middle character c corresponds to the line and character
Figure 817087DEST_PATH_IMAGE008
Corresponding to the element values of the column.
8. The method for grouping characters based on word frequency as claimed in claim 1, wherein: and in the step of traversing the characters one by one and calculating the weights of the characters c to be distributed to all the groups, traversing one by one according to the sequence of the word frequency from high to low.
9. A computer-readable storage medium characterized by: stored thereon a computer program which, when executed by a processor, implements the word frequency based character grouping method as claimed in any one of claims 1 to 8.
10. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory, the processor, when executing the computer program, implementing the word frequency based character grouping method according to any one of claims 1 to 8.
CN202211416941.4A 2022-11-14 2022-11-14 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment Active CN115455987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211416941.4A CN115455987B (en) 2022-11-14 2022-11-14 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211416941.4A CN115455987B (en) 2022-11-14 2022-11-14 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN115455987A true CN115455987A (en) 2022-12-09
CN115455987B CN115455987B (en) 2023-05-05

Family

ID=84295819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211416941.4A Active CN115455987B (en) 2022-11-14 2022-11-14 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115455987B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
US20080109209A1 (en) * 2006-11-02 2008-05-08 University Of Southern California Semi-supervised training for statistical word alignment
CN106372640A (en) * 2016-08-19 2017-02-01 中山大学 Character frequency text classification method
CN107704455A (en) * 2017-10-30 2018-02-16 成都市映潮科技股份有限公司 A kind of information processing method and electronic equipment
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment
CN108415953A (en) * 2018-02-05 2018-08-17 华融融通(北京)科技有限公司 A kind of non-performing asset based on natural language processing technique manages knowledge management method
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning
CN110263325A (en) * 2019-05-17 2019-09-20 交通银行股份有限公司太平洋***中心 Chinese automatic word-cut
JP2019200784A (en) * 2018-05-09 2019-11-21 株式会社アナリティクスデザインラボ Analysis method, analysis device and analysis program
US20210067533A1 (en) * 2018-01-04 2021-03-04 Ping An Technology (Shenzhen) Co., Ltd. Network Anomaly Data Detection Method and Device as well as Computer Equipment and Storage Medium
CN113688615A (en) * 2020-05-19 2021-11-23 阿里巴巴集团控股有限公司 Method, device and storage medium for generating field annotation and understanding character string
CN114708133A (en) * 2022-01-27 2022-07-05 北京国隐科技有限公司 Universal text watermarking method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6137911A (en) * 1997-06-16 2000-10-24 The Dialog Corporation Plc Test classification system and method
US20080109209A1 (en) * 2006-11-02 2008-05-08 University Of Southern California Semi-supervised training for statistical word alignment
CN106372640A (en) * 2016-08-19 2017-02-01 中山大学 Character frequency text classification method
CN107704455A (en) * 2017-10-30 2018-02-16 成都市映潮科技股份有限公司 A kind of information processing method and electronic equipment
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment
US20210067533A1 (en) * 2018-01-04 2021-03-04 Ping An Technology (Shenzhen) Co., Ltd. Network Anomaly Data Detection Method and Device as well as Computer Equipment and Storage Medium
CN108415953A (en) * 2018-02-05 2018-08-17 华融融通(北京)科技有限公司 A kind of non-performing asset based on natural language processing technique manages knowledge management method
JP2019200784A (en) * 2018-05-09 2019-11-21 株式会社アナリティクスデザインラボ Analysis method, analysis device and analysis program
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning
CN110263325A (en) * 2019-05-17 2019-09-20 交通银行股份有限公司太平洋***中心 Chinese automatic word-cut
CN113688615A (en) * 2020-05-19 2021-11-23 阿里巴巴集团控股有限公司 Method, device and storage medium for generating field annotation and understanding character string
CN114708133A (en) * 2022-01-27 2022-07-05 北京国隐科技有限公司 Universal text watermarking method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
丛伟: "基于层叠隐马尔可夫模型的蒙古语词切分***的研究", 《中国优秀博硕士学位论文全文数据库 (硕士)信息科技辑》 *
孙艺玮: "融合词典修正的Bi-LSTM+CRF中文分词方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
易勇: "计算机辅助诗词创作中的风格辨析及联语应对研究", 《中国优秀博硕士学位论文全文数据库 (博士)信息科技辑》 *

Also Published As

Publication number Publication date
CN115455987B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN110795556A (en) Abstract generation method based on fine-grained plug-in decoding
CN109977861A (en) Offline handwritten form method for identifying mathematical formula
CN108009253A (en) A kind of improved character string Similar contrasts method
CN114708133B (en) Universal text watermarking method and device
CN111931489B (en) Text error correction method, device and equipment
CN112016061A (en) Excel document data protection method based on robust watermarking technology
CN108960301A (en) A kind of ancient Yi nationality's text recognition methods based on convolutional neural networks
Gutub et al. Utilizing diacritic marks for Arabic text steganography
CN102402500A (en) Method and system for conversion of PDF (Portable Document Format) file into SWF (Shock Wave Flash) file
CN104050400B (en) A kind of web page interlinkage guard method that steganography is encoded based on command character
CN107220333B (en) character search method based on Sunday algorithm
CN115689853A (en) Robust text watermarking method based on Chinese character characteristic modification and grouping
CN111914825A (en) Character recognition method and device and electronic equipment
CN111488732A (en) Deformed keyword detection method, system and related equipment
CN112861844A (en) Service data processing method and device and server
CN101639828A (en) Method for hiding and extracting watermark based on XML electronic document
CN115618809A (en) Character grouping method based on binary character frequency and safe word stock construction method
CN115455987A (en) Character grouping method based on word frequency and word frequency, storage medium and electronic equipment
CN103136166B (en) Method and device for font determination
CN116975864A (en) Malicious code detection method and device, electronic equipment and storage medium
CN115455965B (en) Character grouping method based on word distance word chain, storage medium and electronic equipment
Khekan et al. New text steganography method using the Arabic letters dots
Rui et al. A multiple watermarking algorithm for texts mixed Chinese and English
CN115455955A (en) Chinese named entity recognition method based on local and global character representation enhancement
CN115455966B (en) Safe word stock construction method and safe code extraction method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant