CN112464642A - Method, device, medium and electronic equipment for adding punctuation to text - Google Patents

Method, device, medium and electronic equipment for adding punctuation to text Download PDF

Info

Publication number
CN112464642A
CN112464642A CN202011344671.1A CN202011344671A CN112464642A CN 112464642 A CN112464642 A CN 112464642A CN 202011344671 A CN202011344671 A CN 202011344671A CN 112464642 A CN112464642 A CN 112464642A
Authority
CN
China
Prior art keywords
word
words
relation
text
dependent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011344671.1A
Other languages
Chinese (zh)
Inventor
颜泽龙
王健宗
吴天博
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011344671.1A priority Critical patent/CN112464642A/en
Publication of CN112464642A publication Critical patent/CN112464642A/en
Priority to PCT/CN2021/084169 priority patent/WO2021213155A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method, a device, a medium and electronic equipment for adding punctuation to a text. The method comprises the following steps: the method comprises the steps of obtaining a text to be added, segmenting the text to be added to obtain a plurality of words, obtaining the relation between each word in the plurality of words, obtaining the dependency word of each word and the relation between each word and the dependency word thereof, determining the relation vector of each word based on each word, the dependency word of each word and the relation between each word and the dependency word thereof, obtaining the relation between the relation vectors of the plurality of words, adding punctuation between the plurality of words based on the relation between the relation vectors, considering the relation between the words in the text to be added and considering the relation between the words in the text to be added, and improving the accuracy of punctuation addition to a certain extent.

Description

Method, device, medium and electronic equipment for adding punctuation to text
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a medium, and an electronic device for adding punctuation to a text.
Background
With the continuous development of artificial intelligence, various deep learning operations are generated. At present, both speech recognition generated text and various social networking corpora are text without any punctuation. Due to the lack of necessary sentence boundary and punctuation information, the readability of the text is low, and certain influences are exerted on downstream natural language processing tasks, such as graph recognition and named entity recognition. The existing punctuation adding method needs artificially constructed features as input, does not consider the features of the text to be added, and is not accurate enough.
Disclosure of Invention
The application aims to provide a method, a device, a medium and electronic equipment for adding punctuations to texts, which can improve the accuracy of punctuation addition to a certain extent.
According to an aspect of an embodiment of the present application, a method for adding punctuation to a text is provided, including: acquiring a text to be added, and segmenting the text to be added to obtain a plurality of words; obtaining the relation between each word in the plurality of words to obtain the dependent words of each word and the relation between each word and the dependent words; determining a relation vector of each word based on each word, a dependent word of each word and a relation between each word and the dependent word thereof; obtaining the relation among the relation vectors of the words; punctuation is added between the plurality of words based on relationships between the relationship vectors.
According to an aspect of the embodiments of the present application, there is provided a text punctuation adding apparatus, including: the acquisition module is configured to acquire a text to be added, and perform word segmentation on the text to be added to obtain a plurality of words; obtaining the relation between each word in the plurality of words to obtain the dependent words of each word and the relation between each word and the dependent words; a determining module configured to determine a relationship vector of each word based on each word, a dependent word of each word, and a relationship between each word and its dependent word; and the adding module is configured to acquire the relation among the relation vectors of the plurality of words and add punctuation among the plurality of words based on the relation among the relation vectors.
In some embodiments of the present application, based on the foregoing solution, the obtaining module is configured to: performing word segmentation on the text to be added according to the text sequence to obtain a first word segmentation result; segmenting the text to be added according to the reverse order of the text to obtain a second segmentation result; obtaining the difference between the first segmentation result and the second segmentation result, and segmenting the text to be added corresponding to the difference from the middle to two sides to obtain a difference result; and replacing the difference between the first segmentation result and the second segmentation result with the difference result, and taking the replaced first segmentation result as the plurality of words.
In some embodiments of the present application, based on the foregoing solution, the obtaining module is configured to: acquiring the part of speech and the position of each word; and determining the relation among the words according to the parts of speech and the positions of the words.
In some embodiments of the present application, based on the foregoing, the determining module is configured to: acquiring a first vector obtained based on each word; obtaining a second vector obtained based on the dependent words of each word; obtaining a third vector obtained based on the relation between each word and the dependent word thereof; and combining the first vector, the second vector and the third vector to obtain a relation vector corresponding to each word.
In some embodiments of the present application, based on the foregoing, the determining module is configured to: coding each word to obtain a first sequence; encoding the dependent words of each word to obtain a second sequence; coding the relation between each word and the dependent word thereof to obtain a third sequence; truncating or zero-filling the first sequence, the second sequence and the third sequence, mapping the truncated or zero-filled first sequence to the first vector, mapping the truncated or zero-filled second sequence to the second vector, and mapping the truncated or zero-filled third sequence to the third vector.
In some embodiments of the present application, based on the foregoing, the adding module is configured to: and inputting the relation vectors of the plurality of words into a pre-trained attention model to obtain the relation between the relation vectors of the plurality of words.
In some embodiments of the present application, based on the foregoing, the adding module is configured to: adding punctuation among the plurality of words to obtain a plurality of adding modes; extracting the features of the relationship between the relationship vectors through a bidirectional LSTM layer; based on the characteristics, calculating the probability of various adding modes by utilizing a Viterbi algorithm, and adding punctuation among the words based on the adding mode with the highest probability in the multiple modes.
According to an aspect of embodiments of the present application, there is provided a computer-readable program medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method of any one of the above.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method of any of the above.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
in the technical scheme provided by some embodiments of the application, the text to be added is obtained, the text to be added is segmented to obtain a plurality of words, the relation between each word in the plurality of words is obtained, the dependent word of each word and the relation between each word and the dependent word are obtained, the relation vector of each word is determined based on each word, the dependent word of each word and the relation between each word and the dependent word, the relation between the relation vectors of the plurality of words is obtained, punctuation is added among the plurality of words based on the relation between the relation vectors, the relation between words in the text to be added is considered, and the accuracy of punctuation addition can be improved to a certain extent.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 illustrates an exemplary system architecture diagram to which aspects of embodiments of the present application may be applied;
FIG. 2 schematically illustrates a flow chart of a text punctuation method of one embodiment of the present application;
FIG. 3 schematically shows a structural diagram of a system for text punctuation according to an embodiment of the present application;
FIG. 4 schematically illustrates a block diagram of a text punctuation adding apparatus according to an embodiment of the present application;
FIG. 5 is a hardware diagram of an electronic device shown in accordance with an exemplary embodiment;
FIG. 6 illustrates a computer-readable storage medium for implementing a method according to an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which the technical solutions of the embodiments of the present application can be applied.
As shown in fig. 1, the system architecture 100 may include a terminal device 101 (which may be one or more of a smartphone, a tablet computer, and a portable computer, and certainly may be a desktop computer, etc.), a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired communication links, wireless communication links, and so forth.
It should be understood that the number of terminal devices 101, networks 102, and servers 103 in fig. 1 is merely illustrative. There may be any number of terminal devices 101, networks 102, and servers 103, as desired for implementation. For example, the server 103 may be a server cluster composed of a plurality of servers.
In an embodiment of the application, the server 103 obtains a text to be added, performs word segmentation on the text to be added to obtain a plurality of words, obtains a relationship between each word in the plurality of words, obtains a dependent word of each word and a relationship between each word and the dependent word thereof, determines a relationship vector of each word based on each word, the dependent word of each word and the relationship between each word and the dependent word thereof, obtains a relationship between relationship vectors of the plurality of words, adds punctuation between the plurality of words based on the relationship between the relationship vectors, considers the relationship between words in the text to be added, and can improve accuracy of punctuation addition to a certain extent.
It should be noted that the text punctuation adding method provided in the embodiment of the present application is generally executed by the server 103, and accordingly, the text punctuation adding device is generally disposed in the middle server 103. However, in other embodiments of the present application, the terminal device 101 may also have a similar function as the server 103, so as to execute the text punctuation adding method provided in the embodiments of the present application.
The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:
fig. 2 schematically shows a flowchart of a text punctuation adding method according to an embodiment of the present application, where an execution subject of the text punctuation adding method may be a server, such as the server 103 shown in fig. 1.
Referring to fig. 2, the text punctuation adding method at least includes steps S210 to S250, which are described in detail as follows:
in step S210, a text to be added is obtained, and the text to be added is subjected to word segmentation to obtain a plurality of words.
In an embodiment of the application, the text to be added can be segmented according to the text sequence to obtain a first segmentation result; segmenting the text to be added according to the reverse order of the text to obtain a second segmentation result; acquiring the difference between the first segmentation result and the second segmentation result, and segmenting the text to be added corresponding to the difference from the middle to two sides to obtain a difference result; and replacing the difference between the first segmentation result and the second segmentation result with a difference result, and taking the replaced first segmentation result as a plurality of words.
In one embodiment of the application, the nonsense characters in the text to be added can be segmented after filtering.
In an embodiment of the present application, each word in the text to be added may be identified, each word is obtained and combined with a word close to the word, and then the word in the preset word list is segmented by referring to the preset word list.
In one embodiment of the present application, the sense of each word may be obtained, and if the senses of adjacent words can be combined, the word and the adjacent words are taken as a word.
In an embodiment of the present application, a text to be added may be input into a pre-trained segmentation model, so as to obtain a plurality of words output by the segmentation model.
In step S220, the relationship between each word in the plurality of words is obtained, and the dependent word of each word and the relationship between each word and its dependent word are obtained.
In one embodiment of the application, the part of speech and the position of each word can be obtained; and determining the relation among the words according to the parts of speech and the positions of the words.
In one embodiment of the present application, word senses of each word may be obtained, and relationships between a plurality of words may be determined based on the word senses.
In an embodiment of the present application, for each word in the plurality of words, the word relationship table may be searched according to the word and any word in the plurality of words, so as to obtain a dependent word having an association with the word and a relationship between the word and the dependent word.
In one embodiment of the present application, a plurality of words may be input into a pre-trained relationship acquisition model, and a relationship between the plurality of words output by the relationship acquisition model is obtained.
In one embodiment of the present application, the relationship acquisition model may be a syntactic dependency tree model.
In one embodiment of the present application, the dependency relationship may include: master-slave relationship, moving-guest relationship, passive relationship, subordinate relationship, fixed collocation, homonym, adjective, etc.
In one embodiment of the present application, a label may be set for each dependency to facilitate the generation of vectors from the labels in the following.
In step S230, a relationship vector of each word is determined based on each word, the dependent word of each word, and the relationship between each word and its dependent word.
In an embodiment of the present application, a first vector obtained based on each word may be obtained; acquiring a second vector obtained based on the dependent words of each word; obtaining a third vector obtained based on the relation between each word and the dependent word thereof; and combining the first vector, the second vector and the third vector to obtain a corresponding relation vector of each word.
In one embodiment of the present application, each word may be encoded to obtain a first sequence; coding the dependent words of each word to obtain a second sequence; coding the relation between each word and the dependent word thereof to obtain a third sequence; and truncating or zero padding the first sequence, the second sequence and the third sequence, mapping the truncated or zero-padded first sequence into a first vector, mapping the truncated or zero-padded second sequence into a second vector, and mapping the truncated or zero-padded third sequence into a third vector.
In one embodiment of the present application, the first sequence, the second sequence, and the third sequence may be truncated from front to back.
In step S240, the relationship between the relationship vectors of the plurality of words is acquired.
In one embodiment of the present application, the relationship vectors of the plurality of words may be input into a pre-trained attention model, to obtain the relationship between the relationship vectors of the plurality of words, and the pre-trained attention model may fully take into account the relationship between each of the relationship vectors.
In step S250, punctuation is added between the plurality of words based on the relationship between the relationship vectors.
In one embodiment of the present application, punctuation may be added between words by conditional random fields.
In one embodiment of the application, punctuation can be added among a plurality of words to obtain a plurality of addition modes; extracting the characteristics of the relation between the relation vectors through a bidirectional LSTM layer; based on the characteristics, calculating the probability of various adding modes by utilizing a Viterbi algorithm, and adding punctuation among a plurality of words based on the adding mode with the highest probability in a plurality of modes.
In this embodiment, the bidirectional LSTM layer can perform deeper specific extraction on the text, resulting in a feature output vector N × K of the text, where K is the number of neurons in the LSTM layer.
For example: suppose that in a certain scene, the predicted punctuation types have three categories, no punctuation, comma, period. For the prediction text [ small doctor is plumes ], 5 positions in total need to be predicted, 3 possible situations in each position are avoided, 3 prediction results are totally obtained, and if [ non-punctuation, and full period ] is the result with the maximum probability value, the final prediction result is [ small doctor is plumes ]. "C (B)
In the embodiment of fig. 2, the text to be added is obtained, the text to be added is segmented to obtain a plurality of words, the relationship between each word in the plurality of words is obtained, the dependent word of each word and the relationship between each word and the dependent word thereof are obtained, the relationship vector of each word is determined based on each word, the dependent word of each word and the relationship between each word and the dependent word thereof, the relationship between the relationship vectors of the plurality of words is obtained, punctuation is added between the plurality of words based on the relationship between the relationship vectors, the relationship between words in the text to be added is considered, and the accuracy of punctuation addition can be improved to a certain extent.
According to the method for adding punctuation to the text, punctuation symbols are added to the Chinese text lacking sentence boundary information, necessary sentence structure information is supplemented, the readability of the text is improved, and the effect of downstream natural language processing tasks is further improved.
In an embodiment of the present application, the present application provides a system for adding punctuation to text, where the system for adding punctuation to text processes text to be added with punctuation using the method for adding punctuation of text of the present application, and fig. 3 schematically illustrates a structural diagram of the system for adding punctuation to text of the present application, and as shown in fig. 3, the system for adding punctuation to text may include an Input module (Input), a syntactic Dependency tree module (Dependency tree), a merge module (con), an Attention module (Attention), a feature extraction module (BiLSTM), a conditional random field module (CRF), and an Output module (Output).
The process of processing the medical information text by applying the method for adding punctuation into the text of the application can comprise the following steps: an input module, a medical information text, the length of the medical information text can be l1The word is segmented by the syntactic dependency tree module, and a sequence taking the word as a unit and having a length of l can be obtained through the segmentation2The length of the sequence in words may be shorter than the length of the text in words. And then, the syntactic relation of the whole sentence is extracted, the dependent words of all the words and the relation between all the words and the dependent words can be extracted, and the related dependent words of all the words and the corresponding semantic relation can be obtained by integrating the obtained syntactic relation.
For example: the length of each position is predicted to be [ B E S B E S B E ] through word segmentation, a text sequence with the length of [ B ] is obtained through integration of the tags, the length of each position is 5, and a total of 5 position triples are obtained through syntax dependency trees, such as (B, doctor, 1), (B, doctor, 2), (doctor, 3), (B, 4), (B, 5). Through the integration of the triples, corresponding related dependent word sequences (doctor is yes) and corresponding semantic relation sequences (12345) can be obtained, wherein each number represents one type of semantic relation, and the specific semantic relations are dozens in total, including master-slave passive relations, fixed collocation and the like. If the word at the current position is at the root position of the syntax dependent number, the corresponding related word is itself, as in the example [ yes ], and the corresponding relation is additionally marked as [ root ].
In an embodiment of the present application, the process of processing the medical information text by applying the text punctuation method of the present application may further include: obtaining semantic vectors according to semantic relations (refer to the steps of obtaining a first vector, a second vector and a third vector above), carrying out length standardization on the semantic vectors, setting the standard length as N, carrying out truncation when the length exceeds N, only reserving the first N words, carrying out zero padding when the length is less than N, obtaining three sequences with the length of N, obtaining a first vector (Word Emb) according to each Word, obtaining a second vector (Parent Emb) according to dependent words of each Word, and obtaining a third vector (relationship Emb) according to the Relation between each Word and the dependent words.
In one embodiment of the present application, the first vector, the second vector and the third vector may be merged together by the merging module, and each word embedding vector is M-dimensional, so that one N × 3M vector is obtained.
In an embodiment of the present application, the process of processing the medical information text by applying the text punctuation method of the present application may further include: and by using a conditional random field, taking the vector extracted by the neural network as input, calculating the probability among all the predicted paths by using a Viterbi algorithm, and selecting the maximum probability value as a result of the punctuation prediction pair. Suppose that in a certain scene, the predicted punctuation types have three categories, no punctuation, comma, period. For the prediction text [ small doctor is plumes ], 5 positions in total need to be predicted, 3 possible situations in each position are avoided, 3 prediction results are totally obtained, and if [ non-punctuation, and full period ] is the result with the maximum probability value, the final prediction result is [ small doctor is plumes ]. [ MEANS FOR solving PROBLEMS ] is provided.
The method for adding punctuation into the text is used for processing the medical information text, Chinese punctuation prediction based on the syntactic dependency tree and the attention mechanism is adopted, the feature extraction capability of an LSTM pair in a neural network is utilized, the modeling capability of a conditional random field on an output sequence is utilized, the syntactic dependency tree and the attention mechanism are skillfully utilized, the connection between words can be fully considered, semantic relation information in the words can be mined as much as possible, the whole sentence is taken as a whole, the rationality of the whole prediction is considered, and the effect is obvious in practical use due to the existing model. The method and the device can automatically add punctuation marks to the text, supplement necessary sentence structure information, and greatly improve the effect of subsequent natural language processing tasks.
Embodiments of the apparatus of the present application are described below, which may be used to implement the robot control methods of the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the robot control method described above in the present application.
FIG. 4 schematically shows a block diagram of a text punctuation adding apparatus according to an embodiment of the present application.
Referring to fig. 4, a text punctuation adding apparatus 400 according to an embodiment of the present application includes an obtaining module 401, a determining module 402, and an adding module 403.
In some embodiments of the application, based on the foregoing scheme, the obtaining module 401 is configured to obtain a text to be added, and perform word segmentation on the text to be added to obtain a plurality of words; obtaining the relation between each word in the plurality of words, and obtaining the dependent words of each word and the relation between each word and the dependent words; the determining module 402 is configured to determine a relationship vector of each word based on each word, a dependent word of each word, and a relationship between each word and its dependent word; the adding module 403 is configured to obtain a relationship between relationship vectors of the plurality of words, and add punctuation between the plurality of words based on the relationship between the relationship vectors.
In some embodiments of the present application, based on the foregoing solution, the obtaining module 401 is configured to: performing word segmentation on the text to be added according to the text sequence to obtain a first word segmentation result; segmenting the text to be added according to the reverse order of the text to obtain a second segmentation result; acquiring the difference between the first segmentation result and the second segmentation result, and segmenting the text to be added corresponding to the difference from the middle to two sides to obtain a difference result; and replacing the difference between the first segmentation result and the second segmentation result with a difference result, and taking the replaced first segmentation result as a plurality of words.
In some embodiments of the present application, based on the foregoing solution, the obtaining module 401 is configured to: acquiring the part of speech and the position of each word; and determining the relation among the words according to the parts of speech and the positions of the words.
In some embodiments of the present application, based on the foregoing scheme, the determining module 402 is configured to: acquiring a first vector obtained based on each word; acquiring a second vector obtained based on the dependent words of each word; obtaining a third vector obtained based on the relation between each word and the dependent word thereof; and combining the first vector, the second vector and the third vector to obtain a corresponding relation vector of each word.
In some embodiments of the present application, based on the foregoing scheme, the determining module 402 is configured to: coding each word to obtain a first sequence; coding the dependent words of each word to obtain a second sequence; coding the relation between each word and the dependent word thereof to obtain a third sequence; and truncating or zero padding the first sequence, the second sequence and the third sequence, mapping the truncated or zero-padded first sequence into a first vector, mapping the truncated or zero-padded second sequence into a second vector, and mapping the truncated or zero-padded third sequence into a third vector.
In some embodiments of the present application, based on the foregoing, the adding module 403 is configured to: and inputting the relation vectors of the plurality of words into a pre-trained attention model to obtain the relation between the relation vectors of the plurality of words.
In some embodiments of the present application, based on the foregoing, the adding module 403 is configured to: adding punctuation among a plurality of words to obtain a plurality of adding modes; extracting the characteristics of the relation between the relation vectors through a bidirectional LSTM layer; based on the characteristics, calculating the probability of various adding modes by utilizing a Viterbi algorithm, and adding punctuation among a plurality of words based on the adding mode with the highest probability in a plurality of modes.
As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 50 according to this embodiment of the present application is described below with reference to fig. 5. The electronic device 50 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 5, electronic device 50 is embodied in the form of a general purpose computing device. The components of the electronic device 50 may include, but are not limited to: the at least one processing unit 51, the at least one memory unit 52, a bus 53 connecting different system components (including the memory unit 52 and the processing unit 51), and a display unit 54.
Wherein the storage unit stores program code executable by the processing unit 51 to cause the processing unit 51 to perform the steps according to various exemplary embodiments of the present application described in the section "example methods" above in this specification.
The storage unit 52 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM)521 and/or a cache memory unit 522, and may further include a read only memory unit (ROM) 523.
The storage unit 52 may also include a program/utility 524 having a set (at least one) of program modules 525, such program modules 525 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 53 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 50 may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 50, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 50 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 55. Also, the electronic device 50 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 56. As shown, the network adapter 56 communicates with other modules of the electronic device 50 over the bus 53. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 50, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present application.
There is also provided, in accordance with an embodiment of the present application, a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present application described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 6, a program product 60 for implementing the above method according to an embodiment of the present application is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present application, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method for adding punctuation to text, comprising:
acquiring a text to be added, and segmenting the text to be added to obtain a plurality of words;
obtaining the relation between each word in the plurality of words to obtain the dependent words of each word and the relation between each word and the dependent words;
determining a relation vector of each word based on each word, a dependent word of each word and a relation between each word and the dependent word thereof;
obtaining the relation among the relation vectors of the words;
punctuation is added between the plurality of words based on relationships between the relationship vectors.
2. The method for adding punctuation to a text according to claim 1, wherein the segmenting the text to be added into words to obtain a plurality of words comprises:
performing word segmentation on the text to be added according to the text sequence to obtain a first word segmentation result;
segmenting the text to be added according to the reverse order of the text to obtain a second segmentation result;
obtaining the difference between the first segmentation result and the second segmentation result, and segmenting the text to be added corresponding to the difference from the middle to two sides to obtain a difference result;
and replacing the difference between the first segmentation result and the second segmentation result with the difference result, and taking the replaced first segmentation result as the plurality of words.
3. The method for adding punctuation to text according to claim 1, wherein said obtaining the relationship between said plurality of words comprises:
acquiring the part of speech and the position of each word;
and determining the relation among the words according to the parts of speech and the positions of the words.
4. The method of adding punctuation to text according to claim 1, wherein the determining a relationship vector corresponding to each word based on each word, a dependent word of each word, and a relationship between each word and its dependent word comprises:
acquiring a first vector obtained based on each word;
obtaining a second vector obtained based on the dependent words of each word;
obtaining a third vector obtained based on the relation between each word and the dependent word thereof;
and combining the first vector, the second vector and the third vector to obtain a relation vector corresponding to each word.
5. The method for adding punctuation to text according to claim 4, characterized in that said obtaining a first vector obtained based on said each word; obtaining a second vector obtained based on the dependent words of each word; obtaining a third vector obtained based on the relationship between each word and its dependent word, including:
coding each word to obtain a first sequence;
encoding the dependent words of each word to obtain a second sequence;
coding the relation between each word and the dependent word thereof to obtain a third sequence;
truncating or zero-filling the first sequence, the second sequence and the third sequence, mapping the truncated or zero-filled first sequence to the first vector, mapping the truncated or zero-filled second sequence to the second vector, and mapping the truncated or zero-filled third sequence to the third vector.
6. The method for adding punctuation to text according to claim 1, wherein said obtaining the relationship between the relationship vectors of the plurality of words comprises:
and inputting the relation vectors of the plurality of words into a pre-trained attention model to obtain the relation between the relation vectors of the plurality of words.
7. The method of claim 1, wherein adding punctuation between the plurality of words based on the relationship between the relationship vectors comprises:
adding punctuation among the plurality of words to obtain a plurality of adding modes;
extracting the features of the relationship between the relationship vectors through a bidirectional LSTM layer;
based on the characteristics, calculating the probability of various adding modes by utilizing a Viterbi algorithm, and adding punctuation among the words based on the adding mode with the highest probability in the multiple modes.
8. An apparatus for adding punctuation to text, comprising:
the acquisition module is configured to acquire a text to be added, and perform word segmentation on the text to be added to obtain a plurality of words; obtaining the relation between each word in the plurality of words to obtain the dependent words of each word and the relation between each word and the dependent words;
a determining module configured to determine a relationship vector of each word based on each word, a dependent word of each word, and a relationship between each word and its dependent word;
and the adding module is configured to acquire the relation among the relation vectors of the plurality of words and add punctuation among the plurality of words based on the relation among the relation vectors.
9. A computer readable program medium having computer program instructions stored thereon, comprising:
the computer program instructions, when executed by a computer, cause the computer to perform the method of any of claims 1-7 above.
10. An electronic device, comprising:
a processor; and
a memory having computer-readable instructions stored thereon; wherein the computer readable instructions, when executed by the processor, implement the method of any of claims 1-7 above.
CN202011344671.1A 2020-11-25 2020-11-25 Method, device, medium and electronic equipment for adding punctuation to text Pending CN112464642A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011344671.1A CN112464642A (en) 2020-11-25 2020-11-25 Method, device, medium and electronic equipment for adding punctuation to text
PCT/CN2021/084169 WO2021213155A1 (en) 2020-11-25 2021-03-30 Method, apparatus, medium, and electronic device for adding punctuation to text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011344671.1A CN112464642A (en) 2020-11-25 2020-11-25 Method, device, medium and electronic equipment for adding punctuation to text

Publications (1)

Publication Number Publication Date
CN112464642A true CN112464642A (en) 2021-03-09

Family

ID=74807954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011344671.1A Pending CN112464642A (en) 2020-11-25 2020-11-25 Method, device, medium and electronic equipment for adding punctuation to text

Country Status (2)

Country Link
CN (1) CN112464642A (en)
WO (1) WO2021213155A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021213155A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Method, apparatus, medium, and electronic device for adding punctuation to text
CN117113941A (en) * 2023-10-23 2023-11-24 新声科技(深圳)有限公司 Punctuation mark recovery method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629237B (en) * 2023-07-25 2023-10-10 江西财经大学 Event representation learning method and system based on gradually integrated multilayer attention

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6589704B2 (en) * 2016-03-17 2019-10-16 日本電気株式会社 Sentence boundary estimation apparatus, method and program
CN107153687B (en) * 2017-04-18 2021-01-05 东北大学 Indexing method for social network text data
CN109062902B (en) * 2018-08-17 2022-12-06 科大讯飞股份有限公司 Text semantic expression method and device
CN109614627B (en) * 2019-01-04 2023-01-20 平安科技(深圳)有限公司 Text punctuation prediction method and device, computer equipment and storage medium
CN110032732A (en) * 2019-03-12 2019-07-19 平安科技(深圳)有限公司 A kind of text punctuate prediction technique, device, computer equipment and storage medium
CN111027291B (en) * 2019-11-27 2024-03-26 达观数据有限公司 Method and device for adding mark symbols in text and method and device for training model, and electronic equipment
CN111414745A (en) * 2020-04-03 2020-07-14 龙马智芯(珠海横琴)科技有限公司 Text punctuation determination method and device, storage medium and electronic equipment
CN112464642A (en) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 Method, device, medium and electronic equipment for adding punctuation to text

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021213155A1 (en) * 2020-11-25 2021-10-28 平安科技(深圳)有限公司 Method, apparatus, medium, and electronic device for adding punctuation to text
CN117113941A (en) * 2023-10-23 2023-11-24 新声科技(深圳)有限公司 Punctuation mark recovery method and device, electronic equipment and storage medium
CN117113941B (en) * 2023-10-23 2024-02-06 新声科技(深圳)有限公司 Punctuation mark recovery method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2021213155A1 (en) 2021-10-28

Similar Documents

Publication Publication Date Title
WO2022134759A1 (en) Keyword generation method and apparatus, and electronic device and computer storage medium
CN107767870B (en) Punctuation mark adding method and device and computer equipment
CN107273503B (en) Method and device for generating parallel text in same language
JP7301922B2 (en) Semantic retrieval method, device, electronic device, storage medium and computer program
CN109522552B (en) Normalization method and device of medical information, medium and electronic equipment
CN112464642A (en) Method, device, medium and electronic equipment for adding punctuation to text
JP7346788B2 (en) Speech recognition model training methods, devices, equipment, and storage media
CN109241286B (en) Method and device for generating text
US20230244704A1 (en) Sequenced data processing method and device, and text processing method and device
CN111079432B (en) Text detection method and device, electronic equipment and storage medium
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
WO2022174496A1 (en) Data annotation method and apparatus based on generative model, and device and storage medium
CN114358007A (en) Multi-label identification method and device, electronic equipment and storage medium
CN111753082A (en) Text classification method and device based on comment data, equipment and medium
CN113947095B (en) Multilingual text translation method, multilingual text translation device, computer equipment and storage medium
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN111027333B (en) Chapter translation method and apparatus
CN114218940B (en) Text information processing and model training method, device, equipment and storage medium
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN113553411B (en) Query statement generation method and device, electronic equipment and storage medium
CN114398943A (en) Sample enhancement method and device thereof
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN112507705A (en) Position code generation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination