WO2022111347A1 - 信息处理方法、装置、电子设备和存储介质 - Google Patents

信息处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2022111347A1
WO2022111347A1 PCT/CN2021/131092 CN2021131092W WO2022111347A1 WO 2022111347 A1 WO2022111347 A1 WO 2022111347A1 CN 2021131092 W CN2021131092 W CN 2021131092W WO 2022111347 A1 WO2022111347 A1 WO 2022111347A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
probability
predicted
target
starting
Prior art date
Application number
PCT/CN2021/131092
Other languages
English (en)
French (fr)
Inventor
王岩
柴琛林
张新松
李航
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2022111347A1 publication Critical patent/WO2022111347A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present disclosure relates to text processing technology, for example, to an information processing method, apparatus, electronic device and storage medium.
  • the text information to be identified After acquiring the text information to be identified, it is judged whether the text information is valid information according to the similarity between the texts.
  • the text information to be identified is judged to be invalid information and is deleted as a whole.
  • the present disclosure provides an information processing method, apparatus, electronic device and storage medium, so as to realize the extraction of effective information from text information to be recognized.
  • An information processing method including:
  • Valid information in the text information is extracted according to the target start character and the target end character.
  • an information processing device comprising:
  • a predicted probability acquisition module configured to acquire the text information to be recognized, and to acquire the predicted starting probability of each character in the text information as the extraction starting point, and each character in the multiple characters as the extraction end point The predicted end probability of ;
  • a target character acquisition module configured to determine a target start character according to the predicted start probability of the plurality of characters, and to determine the target end character according to the predicted end probability of the plurality of characters;
  • the first valid information acquisition module is configured to extract valid information in the text information according to the target start character and the target end character.
  • An electronic device including a memory, a processing device, and a computer program stored in the memory and running on the processing device.
  • the processing device executes the computer program, the information processing method of any embodiment of the present disclosure is implemented.
  • FIG. 1 is a flowchart of an embodiment of an information processing method of the present disclosure
  • FIG. 2 is a flowchart of another embodiment of an information processing method of the present disclosure
  • FIG. 3 is a flowchart of another embodiment of an information processing method of the present disclosure.
  • FIG. 4 is a structural block diagram of an embodiment of an information processing apparatus of the present disclosure.
  • FIG. 5 is a structural block diagram of an electronic device suitable for implementing an embodiment of the present disclosure.
  • method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • Embodiment 1 is a flowchart of an information processing method provided in Embodiment 1 of the present disclosure. This embodiment is applicable to extracting valid information in text information.
  • the method can be executed by an information processing apparatus in this embodiment of the present disclosure. It can be implemented by software and/or hardware and integrated in a terminal device or server, and the method includes the following steps.
  • the source of the text information to be identified is not limited. Due to the numerous sources of text information, there may be useless text content in the acquired text information. For example, when the remarks made by a network user on an event are obtained, due to the fixed expression format of the website when the information is displayed, or the user's personal Speaking habits, usually add some greetings in the reply content. Therefore, it is necessary to extract the required valid content from the text information to be recognized; for example, the text information to be recognized is "Hello, this phenomenon occurs.
  • the starting probability of each character in the text information to be recognized as the starting point of extraction can be predicted according to the starting characters and ending characters of multiple valid information stored in the database, and the starting probability of each character being used as Extract the end probability of the end point. For example, the starting characters of all valid information in the database are counted, the occurrence probability of each starting character is calculated, and the probability is assigned to the same character in the text information to be recognized, and the probability of other characters in the text information to be recognized.
  • the probability of each character in the text information to be recognized as the starting character is obtained; at the same time, the ending characters of all valid information in the database are counted, and the occurrence probability of each ending character is calculated, and the probability It is assigned to the same character in the text information to be recognized, the probability of other characters in the text information to be recognized is zero, and accordingly, the probability of each character in the text information to be recognized as the ending character is obtained.
  • the end probability includes: obtaining the predicted start probability of each character in the text information as the extraction start point and the predicted end probability of each character as the extraction end point through the sequence extraction model completed by training.
  • the sequence extraction model is a pre-trained sorting model, which is used to extract text features for the input text information, and obtain feature vectors; among them, text features are the basic units representing text content, which can be used to convert words in text information.
  • word or phrase as the text feature of the text information, and the feature vector is the result of the quantitative representation of the text feature, usually a multi-dimensional feature vector; after obtaining the feature vector of the text information to be identified, through the identification of the feature vector , output the predicted start probability of each character in the text information as the extraction start point, and the predicted end probability of each character as the extraction end point.
  • the text information to be recognized includes 50 characters (characters include text and symbols), numbered from No. 1 to No. 50 in character order, and the effective content is between No. 5 and No. 30.
  • the predicted probability of the text is 40%, valid
  • the predicted probability of the text between the 8th character and the 30th character is 30%, and the predicted probability of the text between the 5th character and the 20th character is 30%; therefore, the 5th character can be determined as the extraction starting point
  • the method further includes: acquiring a sequence sample set, and performing sequence extraction training on the initial sequence extraction model by using the sequence sample set, so as to obtain a trained sequence extraction model.
  • the sequence sample set includes multiple sequence samples, and each sequence sample is a mapping pair composed of original text information and corresponding valid text information. For example, in a sequence sample, the original text information is "Hello, I will answer you! Wind is a natural phenomenon caused by air flow, and it is caused by solar radiation heat.
  • the valid text information corresponding to this sequence sample is "wind is caused by It is a natural phenomenon caused by air flow, which is caused by solar radiation heat.”, using the original text information of the sequence samples as input information and the valid text information as output information, the initial sequence extraction model is trained for semantic understanding, and finally obtained to the trained sequence extraction model.
  • the initial sequence extraction model is constructed based on a self-attention mechanism.
  • Self-attention Mechanism is an improved mechanism based on Attention Mechanism, which can not only quickly extract important features of sparse data, but also reduce the dependence on external information, which is more conducive to capturing data or The internal correlation of the feature;
  • the initial sequence extraction model can include a bidirectional attention neural network (Bidirectional Encoder Representations from Transformers, Bert) model or the Robert model; wherein, the Bert model is a pre-training ( Pre-Train) language model, after training with a large amount of unlabeled corpus, it can obtain a semantic representation model containing rich semantic information in the text information to be recognized.
  • Bidirectional attention neural network Bidirectional Encoder Representations from Transformers, Bert
  • the Robert model wherein, the Bert model is a pre-training ( Pre-Train) language model, after training with a large amount of unlabeled corpus, it can obtain a semantic representation model containing rich semantic information in the text information to be recognized.
  • the Bert model itself has strong speech understanding ability, only The Bert model needs to be fine-tuned. Therefore, the Bert model is used as the initial sequence extraction model, which reduces the dependence of the initial sequence extraction model on the number of sequence samples in the sequence sample set and reduces the difficulty of training acquisition; the Roberta model is based on the Bert model. On the basis, another semantic representation model obtained by improving the training task and data generation method.
  • the character with the largest probability value is used as the target starting character, that is, the character with the largest starting probability value is most likely to be the starting point of the extracted valid information;
  • the character with the largest probability value is used as the target ending character, that is, the character with the largest ending probability value is most likely to be the end point of the extracted valid information.
  • the 5th character and the 30th character are used as the target start character and the target end character respectively.
  • the determining the target starting character according to the predicted starting probabilities of the multiple characters includes: acquiring a first predicted starting probability with the highest numerical value among the multiple predicted starting probabilities, and determine whether the first character corresponding to the first predicted starting probability is a character; if it is judged that the first character is a character, the first character is used as the target starting character; or if the first character is judged to be a character If it is not a text, obtain the second predicted starting probability with the highest value among the remaining predicted starting probabilities except for the first predicted starting probability, and judge whether the second character corresponding to the second predicted starting probability is is a text until the target character corresponding to the target predicted initial probability with the highest value among the remaining predicted initial probabilities is a text, and the target character is used as the target initial character.
  • a complete piece of valid information must be a text as the starting character. Therefore, if the character corresponding to the highest probability among the multiple predicted starting probabilities is a text, then this character is used as the extraction starting point; If the character corresponding to the probability is not a text (for example, a punctuation mark), then continue to determine whether the character corresponding to the highest probability (that is, the probability with the second highest value among the multiple predicted initial probabilities) is a text in the remaining predicted initial probabilities. , until the target character corresponding to the highest probability value obtained is a text, the target character is used as the extraction starting point, so as to avoid using meaningless characters such as punctuation marks as the extraction starting point, and improve the extraction accuracy of effective information.
  • Determining the target ending character according to the predicted ending probabilities of the multiple characters includes: obtaining a first predicted ending probability with the highest value among the multiple predicted ending probabilities, and determining whether the third character corresponding to the first predicted ending probability is a text ; In the case of judging that the third character is a character, the third character is used as the target end character; Or in the case of judging that the third character is not a character, obtain the end of the first prediction except the In addition to the probability, the second prediction end probability with the highest value among the remaining prediction end probabilities, and determine whether the fourth character corresponding to the second prediction end probability is a text, until the target prediction with the highest value among the remaining prediction end probabilities ends.
  • the target character corresponding to the probability is a character, and the target character is used as the target end character.
  • the method further includes: taking a character that is located after the first character and is closest to the first character as a target starting character. If the character corresponding to the highest probability among the multiple predicted starting probabilities is not a text (for example, a punctuation mark), then there may be a sentence segmentation error in the sequence extraction model. At this time, according to the principle of proximity, the sequence of characters is placed after the first character. , and the character closest to the first character is used as the target starting character to ensure that the corresponding starting character is found near the predicted starting probability with the highest numerical value, and the sequence extraction model can avoid the occurrence of errors in valid information extraction caused by incorrect sentence segmentation.
  • determining the target start character according to the predicted start probability of the multiple characters, and determining the target end character according to the predicted end probability of the multiple characters includes: The first predicted starting probability with the highest numerical value among the predicted starting probabilities corresponds to multiple candidate starting characters, then among the multiple candidate starting characters, the target candidate starting character whose character sequence is at the forefront is selected. character, as the target starting character; and/or if the first predicted ending probability with the highest numerical value among the multiple predicted ending probabilities corresponds to multiple alternative ending characters, then among the multiple alternative ending characters, select the sequence of characters The target alternative end character at the end, as the target end character.
  • the multiple candidate starting characters are arranged in character order, and the character located at the front of the character order is obtained.
  • the extraction range can be increased as much as possible to avoid the loss of effective information; similarly, if there are multiple predicted ending probabilities, the predicted ending probabilities of multiple characters appearing at the same time are all the highest values.
  • the sequence extraction model or other methods can be used for secondary extraction again, that is, after the valid information is extracted, the above-mentioned valid information is extracted again to obtain accurate valid information.
  • the technical solution of judging whether the text information is valid information according to the similarity between the texts can only obtain the text information of the specific content according to the pre-obtained template information. After it is determined that the similarity between the text information and the template information is low, the text information can only be deleted by means of deletion, the effective information in the text information cannot be extracted, and the extraction accuracy of the effective information is low; and the technical solution of the present disclosure , after determining the target start character and target end character, the accurate extraction of valid information in the text information is achieved, and the text extraction capability is also available for text information in unknown fields or without matching templates.
  • the technical solution of the embodiment of the present disclosure is to obtain the predicted start probability of each character in the text information to be recognized as the extraction start point, and the predicted end probability of each character as the extraction end point, and after determining the target start character and target end character , extract the effective information in the text information to be identified, realize the accurate extraction of the effective information in the text, avoid the full text deletion operation on the text information when the text information to be identified contains invalid information,
  • the text information of the matching template also has the ability of text extraction, which expands the application scope of the text extraction technology and improves the extraction accuracy of effective information.
  • FIG. 2 is a flowchart of an information processing method provided in Embodiment 2 of the present disclosure. This embodiment is described based on the above-mentioned embodiment. In this embodiment, after acquiring the text information to be recognized, the Classification of textual information. The method of this embodiment includes the following steps.
  • the text information After acquiring the text information to be identified, the text information can be compared with multiple valid information stored in the database for similarity, and the similarity percentage with the highest numerical value among them can be obtained; None of them contain invalid information text. Therefore, the higher the similarity percentage between the text information to be identified and the multiple valid information stored in the database, the lower the classification probability that the text information contains invalid information text.
  • the similarity percentage is subtracted from the value 1, that is, the classification probability that the text information is a text containing invalid information is obtained.
  • the obtaining the classification probability that the text information is a text containing invalid information includes: obtaining a classification probability that the text information is a text containing invalid information through a trained semantic classification model.
  • Semantic classification model is a pre-trained classification model, whose function is to output the classification category of the text information and the corresponding classification probability through the extraction of text features for the input text information to be recognized; the classification category includes texts containing invalid information and does not contain invalid information text; the classification probability reflects the possibility that the text information contains invalid information. The larger the probability value, the more likely the text content is to contain invalid information.
  • the method before obtaining the classification probability that the text information is a text containing invalid information through the semantic classification model completed by training, the method further includes: obtaining a semantic sample set, and passing the semantic sample set through the semantic sample set. Perform semantic classification training on the initial semantic classification model to obtain the trained semantic classification model; wherein, the initial semantic classification model is constructed based on a neural network.
  • the semantic sample set includes multiple positive semantic samples and multiple negative semantic samples; the positive semantic samples are invalid information samples, that is, all text content in the positive semantic samples is invalid information, for example, "Hello, I hope my answer is correct. You are helpful" and "Sorry, I didn't find the answer.” are two positive semantic samples respectively.
  • the label information of the positive semantic sample is 1, that is, when the initial semantic classification model is trained, the output result is marked as 1; the negative semantic sample is marked as 1.
  • It is a valid information sample that is, all the text content in the negative semantic sample is valid information, for example, "the density of water is greater than the density of ice” and “earthquakes are vibrations caused by the rapid release of energy from the crust, during which seismic waves will be generated.
  • a natural phenomenon is two negative semantic samples respectively, and the label information of the negative semantic samples is 0, that is, when the initial semantic classification model is trained, the output result is marked as 0.
  • the training of the initial semantic classification model by positive semantic samples and negative semantic samples enables the trained semantic classification model to have the ability to output the classification category of the text information and the corresponding classification probability according to the input text information, wherein the classification probability It is a value greater than or equal to 0 and less than or equal to 1.
  • the classification probability It is a value greater than or equal to 0 and less than or equal to 1.
  • the larger the value the closer the text information is to positive semantic samples, and the greater the probability of containing invalid information.
  • the smaller the value the closer the text information is to negative semantics sample, the smaller the probability of including invalid information.
  • classification probability is greater than or equal to a preset classification probability threshold, obtain a predicted start probability of each character in the text information as an extraction start point, and a predicted end probability of each character as an extraction end point.
  • the preset classification probability threshold can be set as required. Since in the above technical solution, the output classification category only includes two types, namely, the text containing invalid information and the text not containing invalid information. Therefore, the preset classification probability threshold can be set to 0.5 according to the binary classification, that is, when the text When the classification category of the information is text containing invalid information (that is, the classification probability is greater than or equal to 0.5), the text information is input into the sequence extraction model for extraction; when the classification category of the text information is the text that does not contain invalid information (that is, the classification probability is less than 0.5), the text information itself is regarded as valid information.
  • the preset classification probability threshold can be set to a small value, for example, 0.05, that is, When the classification probability value of the text information is less than 0.05, the classification category of the text information is determined as not containing invalid information text. As long as the classification probability value of the text information is greater than or equal to 0.05, the sequence of the text information needs to be completed by training.
  • the extraction model is used for extraction to ensure the classification accuracy of the text that does not contain invalid information, and to avoid invalid information still existing in the text that does not contain invalid information generated by the classification.
  • the classification probability is less than the preset classification probability threshold, it indicates that the text information is judged to not contain invalid information text, that is, the text content in the text information is all valid information, and the text information is regarded as valid information at this time, also That is, all text contents in the text information are regarded as valid information.
  • the classification probability that the text information contains irrelevant content is obtained, and after the classification probability is greater than or equal to a preset classification probability threshold, effective information is extracted , realizes the effective judgment of whether the text information contains irrelevant content, so that only effective information is extracted for the text containing irrelevant content, and the extraction efficiency of effective information in the text information is improved.
  • FIG. 3 is a flowchart of an information processing method provided in Embodiment 3 of the present disclosure. This embodiment is described based on the above-mentioned embodiment.
  • the text information to be recognized includes a plurality of valid information paragraphs.
  • the method of this embodiment includes the following steps.
  • S310 Acquire the text information to be recognized, and acquire the predicted start probability of each of the multiple characters in the text information as the extraction starting point, and the predicted end probability of each of the multiple characters as the extraction end point.
  • the text information to be identified may include multiple pieces of valid information, for example, the text information to be identified is "Hello! Acid rain refers to rain and snow with pH less than 5.6 or other forms of precipitation, mainly man-made discharge of a large amount of acidic substances into the atmosphere.
  • the earthquake is the vibration caused by the rapid release of energy from the earth's crust, during which a natural phenomenon of seismic waves will be generated.
  • the text message includes two Two valid information paragraphs, namely "Acid rain refers to rain, snow or other forms of precipitation with a pH of less than 5.6, mainly caused by man-made discharge of large amounts of acidic substances into the atmosphere” and “Earthquakes are caused by the rapid release of energy from the earth's crust. Vibration, a natural phenomenon during which seismic waves are generated", respectively explaining the two natural phenomena of "acid rain” and "earthquake". Starting probability, among multiple predicted ending probabilities, a preset number of predicted ending probabilities with the highest value are obtained, so as to locate multiple valid information paragraphs, and obtain characters corresponding to multiple probability values.
  • S330 Determine the character order of the first character, the second character, the third character and the fourth character.
  • Character sorting that is, the order in which multiple characters are arranged in the text information to be recognized.
  • the text information to be recognized includes 50 characters (characters include characters and symbols), which are numbered from 1 to 1 in character order. No. 50.
  • Acid rain refers to rain, snow or other forms of precipitation with a pH of less than 5.6, mainly caused by man-made discharge of a large amount of acidic substances into the atmosphere; and earthquakes, It is a natural phenomenon caused by the vibration caused by the rapid release of energy from the earth's crust, during which seismic waves are generated.
  • the fourth character is the "image” in the text "a natural phenomenon that produces seismic waves during the period”
  • the character arrangement order of the text information to be recognized conforms to the above arrangement rules, accordingly, “acid” and “earth” ” as the first target start character and the second target start character respectively, and “ ⁇ ” and “xiang” as the first target end character and the second target end character respectively.
  • the text information between “acid” and “de” is extracted as the first valid information, that is, "acid rain refers to rain, snow or other forms of precipitation with pH less than 5.6, which is mainly caused by man-made emission into the atmosphere.
  • the text information between "earth” and “elephant” is extracted as the second effective information, that is, "earthquake, which is a kind of vibration caused by the rapid release of energy from the earth's crust, during which seismic waves are generated. natural phenomenon”.
  • Embodiment 4 is a structural block diagram of an information processing apparatus provided in Embodiment 4 of the present disclosure, including: a predicted probability acquisition module 410 , a target character acquisition module 420 , and a first valid information acquisition module 430 .
  • the predicted probability obtaining module 410 is configured to obtain the text information to be recognized, and obtain the predicted starting probability of each character in the multiple characters in the text information as the extraction starting point, and each character in the multiple characters as the extraction starting point. the predicted end probability of the end point;
  • the target character acquisition module 420 is configured to determine the target start character according to the predicted start probability of the plurality of characters, and to determine the target end character according to the predicted end probability of the plurality of characters;
  • the first valid information obtaining module 430 is configured to extract valid information in the text information according to the target start character and the target end character.
  • the technical solution of the embodiment of the present disclosure is to obtain the predicted start probability of each character in the text information to be recognized as the extraction start point, and the predicted end probability of each character as the extraction end point, and after determining the target start character and target end character , extract the effective information in the text information to be identified, realize the accurate extraction of the effective information in the text, avoid the full text deletion operation on the text information when the text information to be identified contains invalid information,
  • the text information of the matching template also has the ability of text extraction, which expands the application scope of the text extraction technology and improves the extraction accuracy of effective information.
  • the predicted probability obtaining module 410 is configured to obtain the predicted starting probability of each character in the text information as the starting point of extraction, which is a sequence extraction model completed through training, and the Each character serves as the predicted end probability for the extraction end point.
  • the information processing device further includes:
  • the classification probability obtaining module is configured to obtain the classification probability that the text information is a text containing invalid information.
  • the prediction probability obtaining module 410 is set to obtain the prediction of each character in the text information as the extraction starting point if the classification probability is greater than or equal to the preset classification probability threshold.
  • the classification probability obtaining module is a semantic classification model set through training, and obtains the classification probability that the text information is a text containing invalid information.
  • the information processing device further includes:
  • the second valid information acquisition module is configured to use the text information as valid information if the classification probability is less than a preset classification probability threshold.
  • the information processing device further includes:
  • a semantic classification model acquisition module configured to acquire a semantic sample set, and perform semantic classification training on an initial semantic classification model through the semantic sample set to obtain a trained semantic classification model; wherein, the initial semantic classification model is based on a neural network Construct.
  • the information processing device further includes:
  • the sequence extraction model acquisition module is configured to acquire a sequence sample set, and perform sequence extraction training on the initial sequence extraction model through the sequence sample set, so as to obtain a trained sequence extraction model.
  • the target character acquisition module 420 includes:
  • a first predicted starting probability obtaining unit configured to obtain the first predicted starting probability with the highest value among the multiple predicted starting probabilities, and determine whether the first character corresponding to the first predicted starting probability is a text
  • the first target starting character obtaining unit is set to take the first character as a target starting character if it is judged that the first character is a character; or, if it is judged that the first character is not a character, obtain all
  • the second predicted starting probability with the highest value among the remaining predicted starting probabilities is determined, and whether the second character corresponding to the second predicted starting probability is a text, until the remaining predicted starting probability Among the probabilities, the target character corresponding to the target predicted starting probability with the highest numerical value is a character, and the target character is used as the target starting character.
  • the target character acquisition module 420 further includes:
  • the second target starting character obtaining unit is set to take the text that is located after the first character and is closest to the first character as the target starting character.
  • the target character acquisition module 420 includes:
  • the target starting character acquisition module is set to, if the first predicted starting probability with the highest numerical value among the multiple predicted starting probabilities corresponds to multiple candidate starting characters, then among the multiple candidate starting characters, select The target candidate starting character whose character sequence is at the forefront is used as the target starting character;
  • the target ending character acquisition module is set to if the first predicted ending probability with the highest numerical value among the multiple predicted ending probabilities corresponds to multiple alternative ending characters, then among the multiple alternative ending characters, the one whose character order is located at the end is selected.
  • the target alternative end character as the target end character.
  • the target character acquisition module 420 further includes:
  • a character extraction unit configured to obtain a first predicted start probability with the highest numerical value among the multiple predicted starting probabilities, a second predicted starting probability with the second highest numerical value, and a first predicted ending probability with the highest numerical value among the multiple predicted ending probabilities and the second highest prediction end probability, and obtain the first prediction start probability, the second prediction start probability, the first prediction end probability and the second prediction end probability respectively corresponding the first character, the second character, the third character and the fourth character;
  • a character sorting execution unit configured to determine the character sorting of the first character, the second character, the third character and the fourth character
  • a target character extraction unit configured to extract the first character and the third character if the characters are sorted into the first character, the third character, the second character and the fourth character as the first target start character and the first target end character respectively, and the second character and the fourth character as the second target start character and the second target end character respectively.
  • the first valid information obtaining module 430 is configured to extract the first valid information in the text information according to the first target start character and the first target end character. , and extract the second valid information in the text information according to the second target start character and the second target end character.
  • the initial sequence extraction model is constructed based on a self-attention mechanism.
  • the above apparatus can execute the information processing method provided by any embodiment of the present disclosure, and has functional modules and effects corresponding to the execution method.
  • the execution method can be executed by any embodiment of the present disclosure, and has functional modules and effects corresponding to the execution method.
  • FIG. 5 shows a schematic structural diagram of an electronic device 500 suitable for implementing an embodiment of the present disclosure.
  • the terminal device in the embodiment of the present disclosure may include, for example, a mobile phone, a notebook computer, a digital broadcast receiver, a personal digital assistant (PDA), a tablet computer (Portable Android Device, PAD), a portable multimedia player (Portable Multimedia Player, PMP), mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals), and the like, and stationary terminals such as digital televisions (television, TV), desktop computers, and the like.
  • PDA personal digital assistant
  • PAD Portable Android Device
  • PMP portable multimedia player
  • mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals), and the like
  • stationary terminals such as digital televisions (television, TV), desktop computers, and the like.
  • the electronic device shown in FIG. 5 is only an example, and should not impose any limitation on the function and scope of use
  • the electronic device 500 may include a processing device (such as a central processing unit, a graphics processor, etc.) 501, and the processing device 501 may be based on a program stored in a read-only memory (Read-only Memory, ROM) 502 or from a Storage device 508 loads programs into random access memory (RAM) 503 to perform various appropriate actions and processes.
  • ROM read-only Memory
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 500 are also stored.
  • the processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504.
  • An Input/Output (I/O) interface 505 is also connected to the bus 504 .
  • the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD), speakers , an output device 507 of a vibrator, etc.; a storage device 508 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 509 .
  • Communication means 509 may allow electronic device 500 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 5 shows an electronic device 500 having various means, it is not required to implement or have all of the illustrated means. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 509, or from the storage device 508, or from the ROM 502.
  • the processing apparatus 501 When the computer program is executed by the processing apparatus 501, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the embodiments of the present disclosure provide a storage medium containing computer-executable instructions, and the computer-executable instructions, when executed by a computer processor, implement the information processing methods provided by the foregoing embodiments.
  • the computer-readable medium described above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium can be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • Computer-readable storage media may include: electrical connections with one or more wires, portable computer disks, hard disks, RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), flash memory, optical fibers , portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the storage medium may be a non-transitory storage medium.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • the program code embodied on the computer readable medium may be transmitted by any suitable medium, including: electric wire, optical fiber cable, radio frequency (RF), etc., or any suitable combination of the above.
  • clients and servers can communicate using any currently known or future developed network protocols, such as HyperText Transfer Protocol (HTTP), and can communicate with digital data in any form or medium.
  • Communication eg, a communication network
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is made to: obtain the text information to be recognized, and obtain the text information in the multiple characters in the text information.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, but also conventional Procedural programming language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a LAN or WAN, or may be connected to an external computer (eg, using an Internet service provider to connect through the Internet).
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • Each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or special purpose hardware implemented in combination with computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware.
  • the name of the module does not constitute a limitation of the module itself in one case.
  • the prediction probability acquisition module can be described as "a sequence extraction model used to obtain text information to be recognized and completed through training, A module that obtains the predicted start probability of each character in the text information as the extraction start point, and the predicted end probability of each character as the extraction end point".
  • the functions described herein above may be performed, at least in part, by one or more hardware logic components.
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), Application Specific Standard Parts (ASSP) , System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • Machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM, flash memory, optical fibers, portable CD-ROMs, optical storage devices, magnetic storage devices, or the above any suitable combination of content.
  • Example 1 provides an information processing method, including:
  • Valid information in the text information is extracted according to the target start character and the target end character.
  • Example 2 provides the method of Example 1, further comprising:
  • the predicted start probability of each character in the text information as the extraction start point, and the predicted end probability of each character as the extraction end point are obtained.
  • Example 3 provides the method of Example 1, further comprising:
  • classification probability is greater than or equal to a preset probability threshold, obtain the predicted start probability of each character in the text information as the extraction start point, and the predicted end probability of each character as the extraction end point.
  • Example 4 provides the method of Example 3, further comprising:
  • the classification probability that the text information is a text containing invalid information is obtained.
  • Example 5 provides the method of Example 3, further comprising:
  • the text information is used as valid information.
  • Example 6 provides the method of Example 4, further comprising:
  • Example 7 provides the method of Example 2, further comprising:
  • Example 8 provides the method described in any one of Examples 1 to 7, further comprising:
  • the first character is used as the target start character
  • the first character is not a character
  • obtain the second predicted starting probability with the highest value among the remaining predicted starting probabilities except for the first predicted starting probability and judge that the second predicted starting probability is the same as the second predicted starting probability.
  • the target character corresponding to the target predicted initial probability with the highest value among the remaining predicted initial probabilities is a text, and the target character is used as the target initial character.
  • Example 9 provides the method of Example 8, further comprising:
  • the text that is located after the first character and is closest to the first character is used as the target starting character.
  • Example 10 provides the method described in any one of Examples 1 to 7, further comprising:
  • the target candidate whose character sequence is at the forefront among the multiple candidate starting characters.
  • start character as the target start character;
  • the target alternative ending character whose character sequence is at the end is selected as the target end character.
  • Example 11 provides the method described in any one of Examples 1 to 7, further comprising:
  • the characters are sorted as the first character, the third character, the second character and the fourth character, the first character and the third character are respectively used as the start of the first target character and the first target end character, and use the second character and the fourth character as the second target start character and the second target end character respectively;
  • the first target start character and the first target end character extract the first valid information in the text information
  • the second target start character and the second target end character extract the The second valid information in the text information
  • Example 12 provides the method described in Example 7, further comprising:
  • the initial sequence extraction model is constructed based on a self-attention mechanism.
  • Example 13 provides an information processing apparatus, including:
  • a predicted probability obtaining module configured to obtain the text information to be recognized, and obtain the predicted starting probability of each character in the multiple characters in the text information as the extraction starting point, and each character in the multiple characters as the extraction end point The predicted end probability of ;
  • a target character acquisition module configured to determine a target start character according to the predicted start probability of the plurality of characters, and to determine the target end character according to the predicted end probability of the plurality of characters;
  • the first valid information acquisition module is configured to extract valid information in the text information according to the target start character and the target end character.
  • Example 14 provides the apparatus of Example 13, further comprising:
  • the predicted probability acquisition module is a sequence extraction model set up through training to obtain the predicted start probability of each character in the text information as the extraction start point, and the predicted end probability of each character as the extraction end point.
  • Example 15 provides the apparatus of Example 13, further comprising:
  • the classification probability obtaining module is configured to obtain the classification probability that the text information is a text containing invalid information.
  • the predicted probability acquisition module is set to obtain the predicted starting probability of each character in the text information as the extraction starting point through the sequence extraction model completed by training if the classification probability is greater than or equal to the preset probability threshold value, and Describe each character as the predicted end probability of the extraction end point.
  • Example 16 provides the apparatus of Example 15, further comprising:
  • the classification probability obtaining module is a semantic classification model set to be completed through training, and obtains the classification probability that the text information is a text containing invalid information.
  • Example 17 provides the apparatus of Example 15, further comprising:
  • the second valid information acquisition module is configured to use the text information as valid information if the classification probability is less than a preset probability threshold.
  • Example 18 provides the apparatus of Example 16, further comprising:
  • a semantic classification model acquisition module configured to acquire a semantic sample set, and perform semantic classification training on an initial semantic classification model through the semantic sample set to obtain a trained semantic classification model; wherein, the initial semantic classification model is based on a neural network Construct.
  • Example 19 provides the apparatus of Example 14, further comprising:
  • the sequence extraction model acquisition module is configured to acquire a sequence sample set, and perform sequence extraction training on the initial sequence extraction model through the sequence sample set, so as to obtain a trained sequence extraction model.
  • Example 20 provides the apparatus described in any one of Examples 13 to 19, and the target character acquisition module includes:
  • a first predicted starting probability obtaining unit configured to obtain the first predicted starting probability with the highest value among the multiple predicted starting probabilities, and determine whether the first character corresponding to the first predicted starting probability is a text
  • the first target starting character obtaining unit is set to take the first character as the target starting character if it is judged that the first character is a character; or, if it is judged that the first character is not a character, obtain all
  • the second predicted starting probability with the highest value among the remaining predicted starting probabilities is determined, and whether the second character corresponding to the second predicted starting probability is a text, until the remaining predicted starting probability Among the probabilities, the target character corresponding to the target predicted starting probability with the highest numerical value is a character, and the target character is used as the target starting character.
  • Example 21 provides the apparatus of Example 20, the target character acquisition module, further comprising:
  • the second target starting character obtaining unit is set to take the text that is located after the first character and is closest to the first character as the target starting character.
  • Example 22 provides the apparatus described in any one of Examples 13 to 19, and the target character acquisition module includes:
  • the target starting character acquisition module is set to, if the first predicted starting probability with the highest numerical value among the multiple predicted starting probabilities corresponds to multiple candidate starting characters, then among the multiple candidate starting characters, select The target candidate starting character whose character sequence is at the forefront is used as the target starting character;
  • the target ending character acquisition module is set to if the first predicted ending probability with the highest numerical value among the multiple predicted ending probabilities corresponds to multiple alternative ending characters, then among the multiple alternative ending characters, the one whose character order is located at the end is selected.
  • the target alternative end character as the target end character.
  • Example 23 provides the apparatus described in any one of Examples 13 to 19, a target character acquisition module, further comprising:
  • a character extraction unit configured to obtain a first predicted start probability with the highest numerical value among the multiple predicted starting probabilities, a second predicted starting probability with the second highest numerical value, and a first predicted ending probability with the highest numerical value among the multiple predicted ending probabilities and the second highest prediction end probability, and obtain the first prediction start probability, the second prediction start probability, the first prediction end probability and the second prediction end probability respectively corresponding the first character, the second character, the third character and the fourth character;
  • a character sorting execution unit configured to determine the character sorting of the first character, the second character, the third character and the fourth character
  • a target character extraction unit configured to extract the first character and the third character if the characters are sorted into the first character, the third character, the second character and the fourth character as the first target start character and the first target end character respectively, and the second character and the fourth character as the second target start character and the second target end character respectively.
  • the first valid information acquisition module is configured to extract the first valid information in the text information according to the first target start character and the first target end character, and to extract the first valid information in the text information according to the first target start character and the first target end character.
  • the second target end character is extracted, and the second valid information in the text information is extracted.
  • Example 24 provides the apparatus of Example 19, further comprising:
  • the initial sequence extraction model is constructed based on a self-attention mechanism.
  • Example 25 provides an electronic device, including a memory, a processing device, and a computer program stored in the memory and executable on the processing device, and the processing device executes the program to achieve the following: The information processing method described in any one of Examples 1-12.
  • Example 26 provides a storage medium containing computer-executable instructions, when executed by a computer processor, for performing any of Examples 1-12 The described information processing method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开实施例公开了一种信息处理方法、装置、电子设备和存储介质,信息处理方法包括:获取待识别的文本信息,并获取文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及多个字符中每个字符作为提取终点的预测结束概率(S110);根据多个字符的预测起始概率确定目标起始字符,以及根据多个字符的预测结束概率确定目标结束字符(S120);根据目标起始字符和目标结束字符,提取文本信息中的有效信息(S130)。

Description

信息处理方法、装置、电子设备和存储介质
本申请要求在2020年11月24日提交中国专利局、申请号为202011330581.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及文本处理技术,例如涉及一种信息处理方法、装置、电子设备和存储介质。
背景技术
随着互联网技术的不断发展,多种多样的文本信息出现在网络中,为人们提供了大量的信息资料,而在众多的信息资料中提取出有效信息,也成为了文本处理领域的重要分支。
在获取到待识别的文本信息后,根据文本间的相似度判断该文本信息是否为有效信息,若相似度较高,则判断该文本信息为有效信息进行保留,若相似度较低,则将待识别的文本信息判断为无效信息进行整体删除。
发明内容
本公开提供了一种信息处理方法、装置、电子设备和存储介质,以实现从待识别的文本信息中提取有效信息。
提供了一种信息处理方法,包括:
获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;
根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
还提供了一种信息处理装置,包括:
预测概率获取模块,设置为获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
目标字符获取模块,设置为根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;
第一有效信息获取模块,设置为根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
还提供了一种电子设备,包括存储器、处理装置及存储在存储器上并可在处理装置上运行的计算机程序,处理装置执行计算机程序时实现本公开任意实施例的信息处理方法。
还提供了一种包含计算机可执行指令的存储介质,计算机可执行指令在由计算机处理器执行时用于执行本公开任意实施例的信息处理方法。
附图说明
图1是本公开的一种信息处理方法的一个实施例的流程图;
图2是本公开的一种信息处理方法的另一个实施例的流程图;
图3是本公开的一种信息处理方法的另一个实施例的流程图;
图4是本公开的一种信息处理装置的一个实施例的结构框图;
图5是适于用来实现本公开实施例的一种电子设备的结构框图。
具体实施方式
下面将参照附图描述本公开的实施例。虽然附图中显示了本公开的一些实施例,然而本公开可以通过多种形式来实现,而且不应该被解释为限于这里阐述的实施例。本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
本公开的方法实施方式中记载的多个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,除非在上下文另有指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
实施例一
图1是本公开实施例一提供的一种信息处理方法的流程图,本实施例可适用于提取文本信息中的有效信息,该方法可以由本公开实施例中的信息处理装置来执行,该装置可以通过软件和/或硬件实现,并集成在终端设备或服务器中,该方法包括如下步骤。
S110、获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率。
待识别的文本信息来源众多,可以是通过问卷调查获取的用户问答结果,也可以是网络用户针对一个事件发表的言论,还可以是电子书籍、期刊杂志等刊登的文本信息,在本公开实施例中,可选的,对待识别文本信息的来源不作限定。由于文本信息的来源众多,获取到的文本信息中可能存在无用的文本内容,例如,当获取到网络用户针对一个事件发表的言论时,由于网站在信息展示时的固定表述格式,或用户个人的说话习惯,通常会在答复内容中加入一些问候性的客套表述,因此,需要从待识别的文本信息中,提取需要的有效内容;例如,待识别的文本信息为“您好,这种现象产生的原因是烟圈刚开始时聚集到洞口周围,形成了漩涡,而旋转运动使得其不易与空气摩擦,从而保持烟圈的稳定!希望我的回答对你有帮助!”,待识别的文本信息中存在问候性的客套表述,有效信息应为“这种现象产生的原因是烟圈刚开始时聚集到洞口周围,形成了漩涡,而旋转运动使得其不易与空气摩擦,从而保持烟圈的稳定!”。
获取到待识别的文本信息后,可以根据数据库中存储的多个有效信息的起始字符和结束字符,来预测待识别文本信息中每个字符作为提取起点的起始概率,以及每个字符作为提取终点的结束概率。例如,统计数据库中所有有效信息的起始字符,并计算出每个起始字符的出现概率,并将该概率赋值给待识别文本信息中的相同字符,待识别文本信息中的其它字符的概率为零,据此,获 取到待识别文本信息中每个字符作为起始字符的概率;同时,统计数据库中所有有效信息的结束字符,并计算出每个结束字符的出现概率,并将该概率赋值给待识别文本信息中的相同字符,待识别文本信息中的其它字符的概率为零,据此,获取到待识别文本信息中每个字符作为结束字符的概率。
可选的,在本公开实施例中,所述获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率,包括:通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。序列抽取模型,是预先训练完成的排序模型,作用在于针对输入的文本信息,进行文本特征的提取,并获取特征向量;其中,文本特征是表示文本内容的基本单位,可以将文本信息中的字、词或短语作为该文本信息的文本特征,而特征向量则是文本特征量化表示的结果,通常为多维度的特征向量;在获取到待识别文本信息的特征向量后,通过对特征向量的识别,输出文本信息中每个字符作为提取起点的预测起始概率,以及每个字符作为提取终点的预测结束概率。例如,待识别文本信息包括50个字符(字符包括文字和符号),按字符顺序编号为1号至50号,有效内容为5号字符至30号字符之间文本的预测概率为40%,有效内容为8号字符至30号字符之间文本的预测概率为30%,有效内容为5号字符至20号字符之间文本的预测概率为30%;由此,可以确定5号字符作为提取起点的预测概率为40%+30%=70%,8号字符作为提取起点的预测概率为30%,其它字符作为提取起点的预测概率为0;20号字符作为提取终点的预测概率为30%,30号字符作为提取终点的预测概率为40%+30%=70%,其它字符作为提取终点的预测概率为0。
可选的,在本公开实施例中,在通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率前,还包括:获取序列样本集合,并通过所述序列样本集合对初始序列抽取模型进行序列抽取训练,以获取训练完成的序列抽取模型。序列样本集合包括了多个序列样本,每个序列样本是由原始文本信息及对应的有效文本信息组成的映射对,例如,一个序列样本中,原始文本信息为“您好,我来回答您!风是由空气流动引起的一种自然现象,它是由太阳辐射热引起的。如果您对我的回答满意,请给我点赞。”,该序列样本对应的有效文本信息为“风是由空气流动引起的一种自然现象,它是由太阳辐射热引起的。”,将序列样本的原始文本信息作为输入信息,有效文本信息作为输出信息,对初始序列抽取模型进行语义理解训练,最终获取到训练完成的序列抽取模型。
可选的,在本公开实施例中,所述初始序列抽取模型基于自注意力机制构建。自注意力机制(Self-attention Mechanism)是在注意力机制(Attention  Mechanism)基础上的改进机制,不但可以快速提取稀疏数据的重要特征,同时减少了对外部信息的依赖,更有利于捕捉数据或特征的内部相关性;在本实施例中,所述初始序列抽取模型可以包括双向注意力神经网络(Bidirectional Encoder Representations from Transformers,Bert)模型或Robert模型;其中,Bert模型,是一种预训练(Pre-Train)的语言模型,在经过大量且无标注的语料训练后,能够获取待识别文本信息中包含丰富语义信息的语义表示模型,由于Bert模型本身已具备了较强的语音理解能力,只需对Bert模型进行微调即可,因此使用Bert模型作为初始序列抽取模型,减少了初始序列抽取模型对序列样本集合中序列样本数量的依赖,降低了训练获取难度;Roberta模型则是在Bert模型的基础上,通过改进训练任务和数据生成方式获取到的另一种语义表示模型。
S120、根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符。
在获取到多个字符的预测起始概率后,将概率数值最大的字符作为目标起始字符,也即起始概率数值最大的字符最有可能作为提取出来的有效信息的起点;在获取到多个字符的预测结束概率后,将概率数值最大的字符作为目标结束字符,也即结束概率数值最大的字符最有可能作为提取出来的有效信息的终点。例如,以上述技术方案为例,将5号字符和30号字符分别作为目标起始字符和目标结束字符。
可选的,在本公开实施例中,所述根据所述多个字符的预测起始概率确定目标起始字符,包括:获取多个预测起始概率中数值最高的第一预测起始概率,并判断所述第一预测起始概率对应的第一字符是否为文字;若判断所述第一字符为文字,则将所述第一字符作为目标起始字符;或者若判断所述第一字符不为文字,则获取除所述第一预测起始概率外,剩余预测起始概率中数值最高的第二预测起始概率,并判断与所述第二预测起始概率对应的第二字符是否为文字,直至剩余预测起始概率中,与数值最高的目标预测起始概率对应的目标字符为文字为止,并将所述目标字符作为目标起始字符。一段完整的有效信息,必定是由文字作为起始字符,因此,如果多个预测起始概率中最高概率对应的字符为文字,那么将该字符作为提取起点;如果多个预测起始概率中最高概率对应的字符不为文字(例如,为标点符号),那么在剩余预测起始概率中继续确定最高概率(也即多个预测起始概率中数值第二高的概率)对应的字符是否为文字,直至获取到的最高概率值对应的目标字符为文字时,将该目标字符作为提取起点,以避免将标点符号等无实际意义的字符作为提取起点,提高有效信息的提取精度。
根据所述多个字符的预测结束概率确定目标结束字符,包括:获取多个预 测结束概率中数值最高的第一预测结束概率,并判断所述第一预测结束概率对应的第三字符是否为文字;在判断所述第三字符为文字的情况下,将所述第三字符作为所述目标结束字符;或者在判断所述第三字符不为文字的情况下,获取除所述第一预测结束概率外,剩余预测结束概率中数值最高的第二预测结束概率,并判断与所述第二预测结束概率对应的第四字符是否为文字,直至剩余预测结束概率中,与数值最高的目标预测结束概率对应的目标字符为文字为止,并将所述目标字符作为所述目标结束字符。
在判断所述第一字符不为文字后,还包括:将位于所述第一字符之后,且与所述第一字符距离最近的文字作为目标起始字符。如果多个预测起始概率中最高概率对应的字符不为文字(例如,为标点符号),那么有可能是序列抽取模型出现了断句错误,此时根据就近原则,将字符顺序位于第一字符之后,且与第一字符距离最近的文字作为目标起始字符,以确保在数值最高的预测起始概率附近寻找对应的起点文字,避免序列抽取模型由于误断句导致的有效信息提取错误的情况发生。
可选的,在本公开实施例中,所述根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符,包括:若多个预测起始概率中数值最高的第一预测起始概率,对应多个备选起始字符,则在所述多个备选起始字符中,选择字符顺序位于最前端的目标备选起始字符,作为目标起始字符;和/或若多个预测结束概率中数值最高的第一预测结束概率,对应多个备选结束字符,则在所述多个备选结束字符中,选择字符顺序位于最后端的目标备选结束字符,作为目标结束字符。如果在多个预测起始概率中,同时出现了多个字符的预测起始概率均为最高数值,此时将多个备选起始字符按照字符顺序排列,获取位于字符顺序最靠前的字符作为目标起始字符,以尽可能大的提高提取范围,避免有效信息的丢失;同样的,如果在多个预测结束概率中,同时出现了多个字符的预测结束概率均为最高数值,此时将多个备选结束字符按照字符顺序排列,获取位于字符顺序最靠后的字符作为目标结束字符,以尽可能大的提高提取范围,避免有效信息的丢失。在提取出有效信息后,还可以再次通过序列提取模型,或者其它方法进行二次提取,即在提取出有效信息后,对上述有效信息再次进行提取,以获取准确的有效信息。
S130、根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
根据文本间的相似度判断该文本信息是否为有效信息的技术方案,只能根据预先获取的模板信息筛选获取特定内容的文本信息,对于未知领域或无匹配模板的文本信息无法进行识别,同时,在确定文本信息与模板信息的相似度较 低后,只能采用删除的方式删除该文本信息,无法对文本信息中的有效信息进行提取,有效信息的提取精度较低;而本公开的技术方案,在确定目标起始字符和目标结束字符后,实现了对文本信息中有效信息的准确提取,对于未知领域或无匹配模板的文本信息,同样具备了文本提取能力。
本公开实施例的技术方案,获取待识别文本信息中每个字符作为提取起点的预测起始概率,以及每个字符作为提取终点的预测结束概率,并在确定目标起始字符和目标结束字符后,提取待识别文本信息中的有效信息,实现了对文本中有效信息的准确提取,避免了当待识别的文本信息包含无效信息时,对文本信息进行的全文删除操作,同时对于未知领域或无匹配模板的文本信息,同样具备了文本提取能力,扩大了文本提取技术的应用范围,提高了有效信息的提取精度。
实施例二
图2是本公开实施例二提供的一种信息处理方法的流程图,本实施例以上述实施例为基础进行说明,在本实施例中,在获取到待识别的文本信息后,先对该文本信息进行分类。本实施例的方法包括如下步骤。
S210、获取待识别的文本信息,并获取所述文本信息为包含无效信息文本的分类概率。
在获取到待识别的文本信息后,可以将该文本信息与数据库中存储的多个有效信息进行相似性比对,并获取其中数值最高的相似度百分比;由于数据库中存储的多个有效信息中均不包含无效信息文本,因此,待识别的文本信息与数据库中存储的多个有效信息的相似度百分比越高,那么该文本信息包含无效信息文本的分类概率越低。用数值1减去该相似度百分比,即获取该文本信息为包含无效信息文本的分类概率。
可选的,在本实施例中,所述获取所述文本信息为包含无效信息文本的分类概率,包括:通过训练完成的语义分类模型,获取所述文本信息为包含无效信息文本的分类概率。语义分类模型,是预先训练完成的分类模型,作用在于针对输入的待识别文本信息,通过文本特征的提取,输出该文本信息的分类类别以及对应的分类概率;其中,分类类别包括包含无效信息文本和不包含无效信息文本;分类概率则反应了该文本信息中包含无效信息的可能性,概率数值越大,则表明该文本内容越可能为包含了无效信息的文本。
可选的,在本公开实施例中,在通过训练完成的语义分类模型,获取所述文本信息为包含无效信息文本的分类概率前,还包括:获取语义样本集合,并 通过所述语义样本集合对初始语义分类模型进行语义分类训练,以获取训练完成的语义分类模型;其中,所述初始语义分类模型基于神经网络构建。语义样本集合包括了多个正语义样本和多个负语义样本;正语义样本为无效信息样本,即正语义样本中的所有文本内容均为无效信息,例如,“你好,希望我的回答对你有帮助”和“抱歉,未找到答案。”分别为两个正语义样本,正语义样本的标签信息为1,也即对初始语义分类模型进行训练时,输出结果标定为1;负语义样本为有效信息样本,即负语义样本中的所有文本内容均为有效信息,例如,“水的密度大于冰的密度”和“地震,是地壳快速释放能量过程中造成的振动,期间会产生地震波的一种自然现象”分别为两个负语义样本,负语义样本的标签信息为0,也即对初始语义分类模型进行训练时,输出结果标定为0。通过正语义样本和负语义样本对初始语义分类模型的训练,使得训练完成的语义分类模型具备了根据输入的文本信息,输出该文本信息的分类类别以及对应的分类概率的能力,其中,分类概率为大于或等于0,且小于或等于1的数值,该数值越大,该文本信息越接近于正语义样本,包含无效信息的概率越大,该数值越小,该文本信息越接近于负语义样本,包括无效信息的概率越小。
S220、若所述分类概率大于或等于预设分类概率阈值,则获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
预设分类概率阈值,可以根据需要设定。由于上述技术方案中,输出的分类类别只包括两种类型,即包含无效信息文本和不包含无效信息文本,因此,可以根据二值分类,将预设分类概率阈值设定为0.5,即当文本信息的分类类别为包含无效信息文本时(也即分类概率大于或等于0.5),将文本信息输入至序列抽取模型进行提取;当文本信息的分类类别为不包含无效信息文本时(也即分类概率小于0.5),将文本信息本身作为有效信息。虽然分类概率小于0.5时,已将该文本信息的分类类别确定为不包含无效信息文本,但如果该文本信息的分类概率距离0.5较近,例如,为0.4时,那么该文本信息依然有一定的可能性存在无效信息(即存在40%的可能该文本信息中仍然存在无效信息),为了确保文本信息的提取精度,可以将预设分类概率阈值设定为较小数值,例如,0.05,也即当文本信息的分类概率数值小于0.05时,才将该文本信息的分类类别确定为不包含无效信息文本,只要文本信息的分类概率数值大于或等于0.05,则需要将该文本信息通过训练完成的序列抽取模型进行提取,以确保对不包含无效信息文本的分类准确性,避免分类产生的不包含无效信息文本中依然存在无效信息。
若所述分类概率小于预设分类概率阈值,表明该文本信息被判断为不包含无效信息文本,即该文本信息中的文本内容均为有效信息,此时将所述文本信 息作为有效信息,也即该文本信息中的所有文本内容均作为有效信息。
S230、根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符。
S240、根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
本公开实施例的技术方案,在获取到待识别的文本信息后,获取了该文本信息包含无关内容的分类概率,并在该分类概率大于或等于预设分类概率阈值后,进行有效信息的提取,实现了对文本信息中是否包含无关内容的有效判定,使得仅针对包含无关内容的文本进行有效信息的提取,提高了文本信息中有效信息的提取效率。
实施例三
图3是本公开实施例三提供的一种信息处理方法的流程图,本实施例以上述实施例为基础进行说明。在本实施例中,待识别的文本信息中包括多个有效信息段落。本实施例的方法包括如下步骤。
S310、获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及多个字符中每个字符作为提取终点的预测结束概率。
S320、获取多个预测起始概率中数值最高的第一预测起始概率和数值次高的第二预测起始概率,以及多个预测结束概率中数值最高的第一预测结束概率和数值次高的第二预测结束概率,并获取与所述第一预测起始概率、所述第二预测起始概率、所述第一预测结束概率和所述第二预测结束概率分别对应的第一字符、第二字符、第三字符和第四字符。
待识别的文本信息中可能包括多段有效信息,例如,待识别的文本信息为“你好!酸雨是指pH小于5.6的雨雪或其他形式的降水,主要是人为的向大气中排放大量酸性物质所造成的;而地震,是地壳快速释放能量过程中造成的振动,期间会产生地震波的一种自然现象。如果您对我的回答满意,请给我点赞。”,该文本信息中包括两个有效信息段落,即“酸雨是指pH小于5.6的雨雪或其他形式的降水,主要是人为的向大气中排放大量酸性物质所造成的”和“地震,是地壳快速释放能量过程中造成的振动,期间会产生地震波的一种自然现象”,分别解释了“酸雨”和“地震”这两种自然现象,因此,在多个预测起始概率中,获取数值最高的预设数量的预测起始概率,在多个预测结束概率中,获取数值最高的预设数量的预测结束概率,以实现对多个有效信息段落的定位,并 获取多个概率数值对应的字符。
S330、确定所述第一字符、所述第二字符、所述第三字符和所述第四字符的字符排序。
字符排序,也即多个字符在待识别的文本信息中的排列顺序,例如,上述技术方案中,待识别文本信息包括50个字符(字符包括文字和符号),按字符顺序编号为1号至50号。
S340、若所述字符排序为所述第一字符、所述第三字符、所述第二字符和所述第四字符,则将所述第一字符和所述第三字符分别作为第一目标起始字符和第一目标结束字符,以及将第二字符和所述第四字符分别作为第二目标起始字符和第二目标结束字符。
以上述技术方案为例,待识别的文本信息为“你好!酸雨是指pH小于5.6的雨雪或其他形式的降水,主要是人为的向大气中排放大量酸性物质所造成的;而地震,是地壳快速释放能量过程中造成的振动,期间会产生地震波的一种自然现象。如果您对我的回答满意,请给我点赞。”,其中,如果第一字符为文本“酸雨是指pH小于5.6的雨雪或其他形式的降水”中的“酸”,第三字符为文本“排放大量酸性物质所造成的”中的“的”,第二字符为文本“而地震”中的“地”,第四字符为文本“期间会产生地震波的一种自然现象”中的“象”,同时该待识别的文本信息的字符排列顺序符合上述排列规则,据此,将“酸”和“地”分别作为第一目标起始字符和第二目标起始字符,将“的”和“象”分别作为第一目标结束字符和第二目标结束字符。
S350、根据所述第一目标起始字符和第一目标结束字符,提取所述文本信息中的第一有效信息,以及根据所述第二目标起始字符和所述第二目标结束字符,提取所述文本信息中的第二有效信息。
以上述技术方案为例,提取“酸”和“的”之间的文本信息为第一有效信息,即“酸雨是指pH小于5.6的雨雪或其他形式的降水,主要是人为的向大气中排放大量酸性物质所造成的”,提取“地”和“象”之间的文本信息为第二有效信息,即“地震,是地壳快速释放能量过程中造成的振动,期间会产生地震波的一种自然现象”。
本公开实施例的技术方案,在获取到每个字符作为提取起点的预测起始概率,以及每个字符作为提取终点的预测结束概率后,通过获取数值最高的多个预测起始概率以及多个预测结束概率,进而实现对文本信息中多段有效信息的分别提取,避免了有效信息的丢失,保证了提取信息的完整性。
实施例四
图4是本公开实施例四提供的一种信息处理装置的结构框图,包括:预测概率获取模块410、目标字符获取模块420和第一有效信息获取模块430。
预测概率获取模块410,设置为获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
目标字符获取模块420,设置为根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;
第一有效信息获取模块430,设置为根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
本公开实施例的技术方案,获取待识别文本信息中每个字符作为提取起点的预测起始概率,以及每个字符作为提取终点的预测结束概率,并在确定目标起始字符和目标结束字符后,提取待识别文本信息中的有效信息,实现了对文本中有效信息的准确提取,避免了当待识别的文本信息包含无效信息时,对文本信息进行的全文删除操作,同时对于未知领域或无匹配模板的文本信息,同样具备了文本提取能力,扩大了文本提取技术的应用范围,提高了有效信息的提取精度。
可选的,在上述技术方案的基础上,预测概率获取模块410,是设置为通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
可选的,在上述技术方案的基础上,信息处理装置,还包括:
分类概率获取模块,设置为获取所述文本信息为包含无效信息文本的分类概率。
可选的,在上述技术方案的基础上,预测概率获取模块410,是设置为若所述分类概率大于或等于预设分类概率阈值,则获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
可选的,在上述技术方案的基础上,分类概率获取模块,是设置为通过训练完成的语义分类模型,获取所述文本信息为包含无效信息文本的分类概率。
可选的,在上述技术方案的基础上,信息处理装置,还包括:
第二有效信息获取模块,设置为若所述分类概率小于预设分类概率阈值,则将所述文本信息作为有效信息。
可选的,在上述技术方案的基础上,信息处理装置,还包括:
语义分类模型获取模块,设置为获取语义样本集合,并通过所述语义样本集合对初始语义分类模型进行语义分类训练,以获取训练完成的语义分类模型;其中,所述初始语义分类模型基于神经网络构建。
可选的,在上述技术方案的基础上,信息处理装置,还包括:
序列抽取模型获取模块,设置为获取序列样本集合,并通过所述序列样本集合对初始序列抽取模型进行序列抽取训练,以获取训练完成的序列抽取模型。
可选的,在上述技术方案的基础上,目标字符获取模块420,包括:
第一预测起始概率获取单元,设置为获取多个预测起始概率中数值最高的第一预测起始概率,并判断所述第一预测起始概率对应的第一字符是否为文字;
第一目标起始字符获取单元,设置为若判断所述第一字符为文字,则将所述第一字符作为目标起始字符;或者若判断所述第一字符不为文字,则获取除所述第一预测起始概率外,剩余预测起始概率中数值最高的第二预测起始概率,并判断与所述第二预测起始概率对应的第二字符是否为文字,直至剩余预测起始概率中,与数值最高的目标预测起始概率对应的目标字符为文字为止,并将所述目标字符作为目标起始字符。
可选的,在上述技术方案的基础上,目标字符获取模块420,还包括:
第二目标起始字符获取单元,设置为将位于所述第一字符之后,且与所述第一字符距离最近的文字作为目标起始字符。
可选的,在上述技术方案的基础上,目标字符获取模块420,包括:
目标起始字符获取模块,设置为若多个预测起始概率中数值最高的第一预测起始概率,对应多个备选起始字符,则在所述多个备选起始字符中,选择字符顺序位于最前端的目标备选起始字符,作为目标起始字符;
目标结束字符获取模块,设置为若多个预测结束概率中数值最高的第一预测结束概率,对应多个备选结束字符,则在所述多个备选结束字符中,选择字符顺序位于最后端的目标备选结束字符,作为目标结束字符。
可选的,在上述技术方案的基础上,目标字符获取模块420,还包括:
字符提取单元,设置为获取多个预测起始概率中数值最高的第一预测起始概率和数值次高的第二预测起始概率,以及多个预测结束概率中数值最高的第一预测结束概率和数值次高的第二预测结束概率,并获取与所述第一预测起始概率、所述第二预测起始概率、所述第一预测结束概率和所述第二预测结束概率分别对应的第一字符、第二字符、第三字符和第四字符;
字符排序执行单元,设置为确定所述第一字符、所述第二字符、所述第三 字符和所述第四字符的字符排序;
目标字符提取单元,设置为若所述字符排序为所述第一字符、所述第三字符、所述第二字符和所述第四字符,则将所述第一字符和所述第三字符分别作为第一目标起始字符和第一目标结束字符,以及将第二字符和所述第四字符分别作为第二目标起始字符和第二目标结束字符。
可选的,在上述技术方案的基础上,第一有效信息获取模块430,是设置为根据所述第一目标起始字符和第一目标结束字符,提取所述文本信息中的第一有效信息,以及根据所述第二目标起始字符和所述第二目标结束字符,提取所述文本信息中的第二有效信息。
可选的,在上述技术方案的基础上,所述初始序列抽取模型基于自注意力机制构建。
上述装置可执行本公开任意实施例所提供的信息处理方法,具备执行方法相应的功能模块和效果。未在本实施例中描述的技术细节,可参见本公开任意实施例提供的方法。
实施例五
图5示出了适于用来实现本公开实施例的电子设备500的结构示意图。本公开实施例中的终端设备可以包括诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、平板电脑(Portable Android Device,PAD)、便携式多媒体播放器(Portable Multimedia Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(television,TV)、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图5所示,电子设备500可以包括处理装置(例如中央处理器、图形处理器等)501,处理装置501可以根据存储在只读存储器(Read-only Memory,ROM)502中的程序或者从存储装置508加载到随机访问存储器(Random Access Memory,RAM)503中的程序而执行多种适当的动作和处理。在RAM 503中,还存储有电子设备500操作所需的多种程序和数据。处理装置501、ROM 502以及RAM 503通过总线504彼此相连。输入/输出(Input/Output,I/O)接口505也连接至总线504。
以下装置可以连接至I/O接口505:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置506;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置507;包括例如 磁带、硬盘等的存储装置508;以及通信装置509。通信装置509可以允许电子设备500与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有多种装置的电子设备500,但是并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置509从网络上被下载和安装,或者从存储装置508被安装,或者从ROM 502被安装。在该计算机程序被处理装置501执行时,执行本公开实施例的方法中限定的上述功能。
本公开实施例提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时实现上述实施例所提供的信息处理方法。
本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质可以包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、RAM、ROM、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、闪存、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。存储介质可以是非暂态(non-transitory)存储介质。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如超文本传输协议(HyperText Transfer Protocol,HTTP)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide  Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括LAN或WAN—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开多种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在一种情况下并不构成对该模块本身的限定,例如,预测概率获取模块,可以被描述为“用于获取待识别的文本信息,并通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提 取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率的模块”。本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上***(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM、快闪存储器、光纤、便捷式CD-ROM、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,【示例1】提供了一种信息处理方法,包括:
获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
根据多个字符的预测起始概率确定目标起始字符,以及根据多个字符的预测结束概率确定目标结束字符;
根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
根据本公开的一个或多个实施例,【示例2】提供了示例1的方法,还包括:
通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
根据本公开的一个或多个实施例,【示例3】提供了示例1的方法,还包括:
获取所述文本信息为包含无效信息文本的分类概率;
若所述分类概率大于或等于预设概率阈值,则获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
根据本公开的一个或多个实施例,【示例4】提供了示例3的方法,还包括:
通过训练完成的语义分类模型,获取所述文本信息为包含无效信息文本的分类概率。
根据本公开的一个或多个实施例,【示例5】提供了示例3的方法,还包括:
若所述分类概率小于预设分类概率阈值,则将所述文本信息作为有效信息。
根据本公开的一个或多个实施例,【示例6】提供了示例4的方法,还包括:
获取语义样本集合,并通过所述语义样本集合对初始语义分类模型进行语义分类训练,以获取训练完成的语义分类模型;其中,所述初始语义分类模型基于神经网络构建。
根据本公开的一个或多个实施例,【示例7】提供了示例2的方法,还包括:
获取序列样本集合,并通过所述序列样本集合对初始序列抽取模型进行序列抽取训练,以获取训练完成的序列抽取模型。
根据本公开的一个或多个实施例,【示例8】提供了示例1至示例7任一项所述的方法,还包括:
获取多个预测起始概率中数值最高的第一预测起始概率,并判断所述第一预测起始概率对应的第一字符是否为文字;
若判断所述第一字符为文字,则将所述第一字符作为目标起始字符;或者
若判断所述第一字符不为文字,则获取除所述第一预测起始概率外,剩余预测起始概率中数值最高的第二预测起始概率,并判断与所述第二预测起始概率对应的第二字符是否为文字,直至剩余预测起始概率中,与数值最高的目标预测起始概率对应的目标字符为文字为止,并将所述目标字符作为目标起始字符。
根据本公开的一个或多个实施例,【示例9】提供了示例8的方法,还包括:
将位于所述第一字符之后,且与所述第一字符距离最近的文字作为目标起始字符。
根据本公开的一个或多个实施例,【示例10】提供了示例1至示例7任一项所述的方法,还包括:
若多个预测起始概率中数值最高的第一预测起始概率,对应多个备选起始字符,则在所述多个备选起始字符中,选择字符顺序位于最前端的目标备选起始字符,作为目标起始字符;
若多个预测结束概率中数值最高的第一预测结束概率,对应多个备选结束 字符,则在所述多个备选结束字符中,选择字符顺序位于最后端的目标备选结束字符,作为目标结束字符。
根据本公开的一个或多个实施例,【示例11】提供了示例1至示例7任一项所述的方法,还包括:
获取多个预测起始概率中数值最高的第一预测起始概率和数值次高的第二预测起始概率,以及多个预测结束概率中数值最高的第一预测结束概率和数值次高的第二预测结束概率,并获取与所述第一预测起始概率、所述第二预测起始概率、所述第一预测结束概率和所述第二预测结束概率分别对应的第一字符、第二字符、第三字符和第四字符;
确定所述第一字符、所述第二字符、所述第三字符和所述第四字符的字符排序;
若所述字符排序为所述第一字符、所述第三字符、所述第二字符和所述第四字符,则将所述第一字符和所述第三字符分别作为第一目标起始字符和第一目标结束字符,以及将第二字符和所述第四字符分别作为第二目标起始字符和第二目标结束字符;
根据所述第一目标起始字符和第一目标结束字符,提取所述文本信息中的第一有效信息,以及根据所述第二目标起始字符和所述第二目标结束字符,提取所述文本信息中的第二有效信息。
根据本公开的一个或多个实施例,【示例12】提供了示例7所述的方法,还包括:
所述初始序列抽取模型基于自注意力机制构建。
根据本公开的一个或多个实施例,【示例13】提供了一种信息处理装置,包括:
预测概率获取模块,设置为获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
目标字符获取模块,设置为根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;
第一有效信息获取模块,设置为根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
根据本公开的一个或多个实施例,【示例14】提供了示例13的装置,还包括:
预测概率获取模块,是设置为通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
根据本公开的一个或多个实施例,【示例15】提供了示例13的装置,还包括:
分类概率获取模块,设置为获取所述文本信息为包含无效信息文本的分类概率。
预测概率获取模块,是设置为若所述分类概率大于或等于预设概率阈值,则通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
根据本公开的一个或多个实施例,【示例16】提供了示例15的装置,还包括:
分类概率获取模块,是设置为通过训练完成的语义分类模型,获取所述文本信息为包含无效信息文本的分类概率。
根据本公开的一个或多个实施例,【示例17】提供了示例15的装置,还包括:
第二有效信息获取模块,设置为若所述分类概率小于预设概率阈值,则将所述文本信息作为有效信息。
根据本公开的一个或多个实施例,【示例18】提供了示例16的装置,还包括:
语义分类模型获取模块,设置为获取语义样本集合,并通过所述语义样本集合对初始语义分类模型进行语义分类训练,以获取训练完成的语义分类模型;其中,所述初始语义分类模型基于神经网络构建。
根据本公开的一个或多个实施例,【示例19】提供了示例14的装置,还包括:
序列抽取模型获取模块,设置为获取序列样本集合,并通过所述序列样本集合对初始序列抽取模型进行序列抽取训练,以获取训练完成的序列抽取模型。
根据本公开的一个或多个实施例,【示例20】提供了示例13至示例19任一项所述的装置,目标字符获取模块,包括:
第一预测起始概率获取单元,设置为获取多个预测起始概率中数值最高的第一预测起始概率,并判断所述第一预测起始概率对应的第一字符是否为文字;
第一目标起始字符获取单元,设置为若判断所述第一字符为文字,则将所 述第一字符作为目标起始字符;或者若判断所述第一字符不为文字,则获取除所述第一预测起始概率外,剩余预测起始概率中数值最高的第二预测起始概率,并判断与所述第二预测起始概率对应的第二字符是否为文字,直至剩余预测起始概率中,与数值最高的目标预测起始概率对应的目标字符为文字为止,并将所述目标字符作为目标起始字符。
根据本公开的一个或多个实施例,【示例21】提供了示例20的装置,目标字符获取模块,还包括:
第二目标起始字符获取单元,设置为将位于所述第一字符之后,且与所述第一字符距离最近的文字作为目标起始字符。
根据本公开的一个或多个实施例,【示例22】提供了示例13至示例19任一项所述的装置,目标字符获取模块,包括:
目标起始字符获取模块,设置为若多个预测起始概率中数值最高的第一预测起始概率,对应多个备选起始字符,则在所述多个备选起始字符中,选择字符顺序位于最前端的目标备选起始字符,作为目标起始字符;
目标结束字符获取模块,设置为若多个预测结束概率中数值最高的第一预测结束概率,对应多个备选结束字符,则在所述多个备选结束字符中,选择字符顺序位于最后端的目标备选结束字符,作为目标结束字符。
根据本公开的一个或多个实施例,【示例23】提供了示例13至示例19任一项所述的装置,目标字符获取模块,还包括:
字符提取单元,设置为获取多个预测起始概率中数值最高的第一预测起始概率和数值次高的第二预测起始概率,以及多个预测结束概率中数值最高的第一预测结束概率和数值次高的第二预测结束概率,并获取与所述第一预测起始概率、所述第二预测起始概率、所述第一预测结束概率和所述第二预测结束概率分别对应的第一字符、第二字符、第三字符和第四字符;
字符排序执行单元,设置为确定所述第一字符、所述第二字符、所述第三字符和所述第四字符的字符排序;
目标字符提取单元,设置为若所述字符排序为所述第一字符、所述第三字符、所述第二字符和所述第四字符,则将所述第一字符和所述第三字符分别作为第一目标起始字符和第一目标结束字符,以及将第二字符和所述第四字符分别作为第二目标起始字符和第二目标结束字符。
第一有效信息获取模块,是设置为根据所述第一目标起始字符和第一目标结束字符,提取所述文本信息中的第一有效信息,以及根据所述第二目标起始字符和所述第二目标结束字符,提取所述文本信息中的第二有效信息。
根据本公开的一个或多个实施例,【示例24】提供了示例19的装置,还包括:
所述初始序列抽取模型基于自注意力机制构建。
根据本公开的一个或多个实施例,【示例25】提供了一种电子设备,包括存储器、处理装置及存储在存储器上并可在处理装置上运行的计算机程序,处理装置执行程序时实现如示例1-12中任一所述的信息处理方法。
根据本公开的一个或多个实施例,【示例26】提供了一种包含计算机可执行指令的存储介质,计算机可执行指令在由计算机处理器执行时用于执行如示例1-12中任一所述的信息处理方法。
此外,虽然采用特定次序描绘了多个操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了多个实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的一些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的多种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。

Claims (15)

  1. 一种信息处理方法,包括:
    获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
    根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;
    根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
  2. 根据权利要求1所述的方法,其中,所述获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率,包括:
    通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
  3. 根据权利要求1所述的方法,其中,在所述获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率前,还包括:
    获取所述文本信息为包含无效信息文本的分类概率;
    所述获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率,包括:
    在所述分类概率大于或等于预设分类概率阈值的情况下,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率。
  4. 根据权利要求3所述的方法,其中,所述获取所述文本信息为包含无效信息文本的分类概率,包括:
    通过训练完成的语义分类模型,获取所述文本信息为包含无效信息文本的分类概率。
  5. 根据权利要求3所述的方法,其中,在所述获取所述文本信息为包含无效信息文本的分类概率后,还包括:
    在所述分类概率小于所述预设分类概率阈值的情况下,将所述文本信息作为有效信息。
  6. 根据权利要求4所述的方法,其中,在所述通过训练完成的语义分类模 型,获取所述文本信息为包含无效信息文本的分类概率前,还包括:
    获取语义样本集合,并通过所述语义样本集合对初始语义分类模型进行语义分类训练,以获取所述训练完成的语义分类模型;其中,所述初始语义分类模型基于神经网络构建。
  7. 根据权利要求2所述的方法,其中,在所述通过训练完成的序列抽取模型,获取所述文本信息中每个字符作为提取起点的预测起始概率,以及所述每个字符作为提取终点的预测结束概率前,还包括:
    获取序列样本集合,并通过所述序列样本集合对初始序列抽取模型进行序列抽取训练,以获取所述训练完成的序列抽取模型。
  8. 根据权利要求1-7中任一项所述的方法,其中,所述根据所述多个字符的预测起始概率确定目标起始字符,包括:
    获取多个预测起始概率中数值最高的第一预测起始概率,并判断所述第一预测起始概率对应的第一字符是否为文字;
    在判断所述第一字符为文字的情况下,将所述第一字符作为所述目标起始字符;或者
    在判断所述第一字符不为文字的情况下,获取除所述第一预测起始概率外,剩余预测起始概率中数值最高的第二预测起始概率,并判断与所述第二预测起始概率对应的第二字符是否为文字,直至剩余预测起始概率中,与数值最高的目标预测起始概率对应的目标字符为文字为止,并将所述目标字符作为所述目标起始字符。
  9. 根据权利要求8所述的方法,其中,在所述判断所述第一字符不为文字后,还包括:
    将位于所述第一字符之后,且与所述第一字符距离最近的文字作为所述目标起始字符。
  10. 根据权利要求1-7中任一项所述的方法,其中,所述根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符,包括:
    在多个预测起始概率中数值最高的第一预测起始概率,对应多个备选起始字符的情况下,在所述多个备选起始字符中,选择字符顺序位于最前端的目标备选起始字符,作为所述目标起始字符;
    在多个预测结束概率中数值最高的第一预测结束概率,对应多个备选结束字符的情况下,在所述多个备选结束字符中,选择字符顺序位于最后端的目标 备选结束字符,作为所述目标结束字符。
  11. 根据权利要求1-7中任一项所述的方法,其中,所述根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符,包括:
    获取多个预测起始概率中数值最高的第一预测起始概率和数值次高的第二预测起始概率,以及多个预测结束概率中数值最高的第一预测结束概率和数值次高的第二预测结束概率,并获取与所述第一预测起始概率、所述第二预测起始概率、所述第一预测结束概率和所述第二预测结束概率分别对应的第一字符、第二字符、第三字符和第四字符;
    确定所述第一字符、所述第二字符、所述第三字符和所述第四字符的字符排序;
    在所述字符排序为所述第一字符、所述第三字符、所述第二字符和所述第四字符的情况下,将所述第一字符和所述第三字符分别作为第一目标起始字符和第一目标结束字符,以及将所述第二字符和所述第四字符分别作为第二目标起始字符和第二目标结束字符;
    所述根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息,包括:
    根据所述第一目标起始字符和所述第一目标结束字符,提取所述文本信息中的第一有效信息,以及根据所述第二目标起始字符和所述第二目标结束字符,提取所述文本信息中的第二有效信息。
  12. 根据权利要求7所述的方法,其中,所述初始序列抽取模型基于自注意力机制构建。
  13. 一种信息处理装置,包括:
    预测概率获取模块,设置为获取待识别的文本信息,并获取所述文本信息中多个字符中每个字符作为提取起点的预测起始概率,以及所述多个字符中每个字符作为提取终点的预测结束概率;
    目标字符获取模块,设置为根据所述多个字符的预测起始概率确定目标起始字符,以及根据所述多个字符的预测结束概率确定目标结束字符;
    第一有效信息获取模块,设置为根据所述目标起始字符和所述目标结束字符,提取所述文本信息中的有效信息。
  14. 一种电子设备,包括存储器、处理装置及存储在所述存储器上并可在所述处理装置上运行的计算机程序,其中,所述处理装置执行所述计算机程序 时实现如权利要求1-12中任一项所述的信息处理方法。
  15. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-12中任一项所述的信息处理方法。
PCT/CN2021/131092 2020-11-24 2021-11-17 信息处理方法、装置、电子设备和存储介质 WO2022111347A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011330581.7 2020-11-24
CN202011330581.7A CN112434510B (zh) 2020-11-24 2020-11-24 一种信息处理方法、装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2022111347A1 true WO2022111347A1 (zh) 2022-06-02

Family

ID=74692945

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131092 WO2022111347A1 (zh) 2020-11-24 2021-11-17 信息处理方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN112434510B (zh)
WO (1) WO2022111347A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434510B (zh) * 2020-11-24 2024-03-29 北京字节跳动网络技术有限公司 一种信息处理方法、装置、电子设备和存储介质
CN113392638A (zh) * 2021-06-11 2021-09-14 北京世纪好未来教育科技有限公司 文本评估方法、装置、设备及介质
CN113836905B (zh) * 2021-09-24 2023-08-08 网易(杭州)网络有限公司 一种主题提取方法、装置、终端及存储介质
CN113641799B (zh) * 2021-10-13 2022-02-11 腾讯科技(深圳)有限公司 文本处理方法、装置、计算机设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020016796A1 (en) * 2000-06-23 2002-02-07 Hurst Matthew F. Document processing method, system and medium
WO2014003138A1 (ja) * 2012-06-29 2014-01-03 Kddi株式会社 入力文字推定装置およびプログラム
CN110175273A (zh) * 2019-05-22 2019-08-27 腾讯科技(深圳)有限公司 文本处理方法、装置、计算机可读存储介质和计算机设备
CN110674271A (zh) * 2019-08-27 2020-01-10 腾讯科技(深圳)有限公司 一种问答处理方法及装置
CN111241832A (zh) * 2020-01-15 2020-06-05 北京百度网讯科技有限公司 核心实体标注方法、装置及电子设备
CN111914559A (zh) * 2020-07-31 2020-11-10 平安科技(深圳)有限公司 基于概率图模型的文本属性抽取方法、装置及计算机设备
CN112434510A (zh) * 2020-11-24 2021-03-02 北京字节跳动网络技术有限公司 一种信息处理方法、装置、电子设备和存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685279B2 (en) * 2016-09-26 2020-06-16 Splunk Inc. Automatically generating field extraction recommendations
CN110162594B (zh) * 2019-01-04 2022-12-27 腾讯科技(深圳)有限公司 文本数据的观点生成方法、装置及电子设备
CN110110715A (zh) * 2019-04-30 2019-08-09 北京金山云网络技术有限公司 文本检测模型训练方法、文本区域、内容确定方法和装置
CN110598213A (zh) * 2019-09-06 2019-12-20 腾讯科技(深圳)有限公司 一种关键词提取方法、装置、设备及存储介质
CN111160032B (zh) * 2019-12-17 2023-03-17 浙江大华技术股份有限公司 一种命名实体提取方法、装置、电子设备及存储介质
CN111639234B (zh) * 2020-05-29 2023-06-27 北京百度网讯科技有限公司 用于挖掘核心实体关注点的方法和装置
CN111832287B (zh) * 2020-07-22 2024-04-19 广东工业大学 一种实体关系联合抽取方法及装置
CN111914825B (zh) * 2020-08-03 2023-10-27 腾讯科技(深圳)有限公司 文字识别方法、装置及电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020016796A1 (en) * 2000-06-23 2002-02-07 Hurst Matthew F. Document processing method, system and medium
WO2014003138A1 (ja) * 2012-06-29 2014-01-03 Kddi株式会社 入力文字推定装置およびプログラム
CN110175273A (zh) * 2019-05-22 2019-08-27 腾讯科技(深圳)有限公司 文本处理方法、装置、计算机可读存储介质和计算机设备
CN110674271A (zh) * 2019-08-27 2020-01-10 腾讯科技(深圳)有限公司 一种问答处理方法及装置
CN111241832A (zh) * 2020-01-15 2020-06-05 北京百度网讯科技有限公司 核心实体标注方法、装置及电子设备
CN111914559A (zh) * 2020-07-31 2020-11-10 平安科技(深圳)有限公司 基于概率图模型的文本属性抽取方法、装置及计算机设备
CN112434510A (zh) * 2020-11-24 2021-03-02 北京字节跳动网络技术有限公司 一种信息处理方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN112434510B (zh) 2024-03-29
CN112434510A (zh) 2021-03-02

Similar Documents

Publication Publication Date Title
WO2022111347A1 (zh) 信息处理方法、装置、电子设备和存储介质
CN109543058B (zh) 用于检测图像的方法、电子设备和计算机可读介质
CN111368559A (zh) 语音翻译方法、装置、电子设备及存储介质
CN111382261B (zh) 摘要生成方法、装置、电子设备及存储介质
WO2021135319A1 (zh) 基于深度学习的文案生成方法、装置及电子设备
CN111274368B (zh) 槽位填充方法及装置
CN111563390B (zh) 文本生成方法、装置和电子设备
KR20210091076A (ko) 비디오를 처리하기 위한 방법, 장치, 전자기기, 매체 및 컴퓨터 프로그램
CN111368560A (zh) 文本翻译方法、装置、电子设备及存储介质
CN111883117A (zh) 语音唤醒方法及装置
WO2022161122A1 (zh) 一种会议纪要的处理方法、装置、设备及介质
CN111400454A (zh) 摘要生成方法、装置、电子设备及存储介质
WO2022017299A1 (zh) 一种文本检测方法、装置、电子设备及存储介质
US20190213646A1 (en) Information display program, data transmission program, data-transmitting apparatus, method for transmitting data, information-providing apparatus, and method for providing information
CN114338586A (zh) 一种消息推送方法、装置、电子设备及存储介质
CN117149140B (zh) 一种用于编码的架构信息生成方法、装置及相关设备
CN112069786A (zh) 文本信息处理方法、装置、电子设备及介质
CN111555960A (zh) 信息生成的方法
WO2022174804A1 (zh) 文本简化方法、装置、设备及存储介质
CN110750994A (zh) 一种实体关系抽取方法、装置、电子设备及存储介质
WO2023000782A1 (zh) 获取视频热点的方法、装置、可读介质和电子设备
WO2022121859A1 (zh) 口语信息处理方法、装置和电子设备
CN115620726A (zh) 语音文本生成方法、语音文本生成模型的训练方法、装置
CN115292487A (zh) 基于朴素贝叶斯的文本分类方法、装置、设备和介质
CN110502630B (zh) 信息处理方法及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896848

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12.09.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21896848

Country of ref document: EP

Kind code of ref document: A1