CN112749256A

CN112749256A - Text processing method, device, equipment and storage medium

Info

Publication number: CN112749256A
Application number: CN202011643798.3A
Authority: CN
Inventors: 任亮; 傅雨梅; 黄新涛
Original assignee: Beijing Zhiyin Intelligent Technology Co ltd
Current assignee: Beijing Zhiyin Intelligent Technology Co ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-05-04

Abstract

The application discloses a text processing method, a text processing device, text processing equipment and a storage medium, wherein the method comprises the following steps: acquiring first text information and second text information to be processed; respectively analyzing the first text information and the second text information to obtain a first feature vector of the first text information and a second feature vector of the second text information; calculating a time warping distance between the first feature vector and the second feature vector; and calculating the similarity information between the first text information and the second text information according to the time warping distance. The method and the device realize more accurate distinction of the text information with the vocabulary ambiguity.

Description

Text processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a text processing method, apparatus, device, and storage medium.

Background

With the popularization and development of mobile intelligent terminal devices and social networks, a great deal of short text data (in a text data form with the character length less than 200) such as news abstracts, microblog messages and commodity comments emerge, and how to dig out information with commercial value from the massive short text data becomes the key point of a plurality of Chinese natural language processing researchers at present. Chinese uses many people and vocabulary is abundant, has nimble various meaning expression, if through carrying out cluster analysis to news abstract's similarity comparison, draws hot topic or as retrieving key word, helps the user to know some more important news information fast. The short text has the characteristics of less number of characters, sparse content, rich semantic information and various expression forms, so the short text plays a great role in the artificial intelligence fields of machine translation, emotion analysis, information retrieval and the like.

For a special short text scene with ambiguous words, the static word vector model is single in word expression mode, difficult to dynamically combine with context, and unable to effectively express two or more kinds of word feature information through low-dimensional dense word vectors. For example, "I bought a bag of millet in supermarket" and "Rejun developed millet Mobile Phone Association in Beijing". The meaning of the expression of the term 'millet' in the two short texts is different, and the term 'millet' in the first short text represents grain and the term 'millet' in the second short text represents a smart phone by combining the context of the vocabulary. Because the same words with different word senses exist in the text, if the feature information expressed by the words in the current context cannot be mined, two short texts are difficult to distinguish.

Disclosure of Invention

The embodiment of the application provides a text processing method which is used for accurately distinguishing text information with lexical ambiguity.

The embodiment of the application provides a text processing method, which comprises the following steps:

acquiring first text information and second text information to be processed;

respectively analyzing the first text information and the second text information to obtain a first feature vector of the first text information and a second feature vector of the second text information;

calculating a time warping distance between the first feature vector and the second feature vector;

and calculating the similarity information between the first text information and the second text information according to the time warping distance.

In an embodiment, the calculating a time warping distance between the first feature vector and the second feature vector comprises:

and calculating the time warping distance between the first feature vector and the second feature vector of the DTW according to a dynamic programming method.

In an embodiment, the calculating a time warping distance between the first feature vector and the second feature vector according to a dynamic programming method DTW includes: when the time warping distance is calculated, calculating a projection matrix of the first eigenvector and a projection matrix of the second eigenvector by a typical correlation analysis method CCA; wherein the projection matrix of the first eigenvector and the projection matrix of the second eigenvector are used to calculate the time warping distance.

In an embodiment, the calculating the similarity information between the first text information and the second text information according to the time warping distance includes:

calculating the similarity information by adopting the following formula:

wherein s1 is the first text information, s2 is the second text information, ctw(s)₁,s₂) Represents the time warping distance, Sim(s), between the first text information s1 and the second text information s2₁,s₂) And the final similarity information is obtained.

In an embodiment, the analyzing the first text information and the second text information respectively to obtain a first feature vector of the first text information and a second feature vector of the second text information includes:

performing word segmentation on the first text information to obtain a first keyword set, and performing word segmentation on the second text information to obtain a second keyword set;

and respectively inputting the first keyword set into a preset feature recognition model, outputting the first feature vector, inputting the second keyword set into the preset feature recognition model, and outputting the second feature vector.

In one embodiment, the step of establishing the preset feature recognition model includes:

obtaining a sample corpus, wherein the sample corpus is labeled with text words and syntactic structural features;

and training a bidirectional encoder characterization model by using the sample corpus to obtain the preset feature recognition model.

In one embodiment, the bidirectional encoder characterizes the model with 768 hidden layer neurons.

A second aspect of the embodiments of the present application provides a text information processing apparatus, including:

the acquisition module acquires first text information and second text information to be processed;

the analysis module is used for respectively analyzing the first text information and the second text information to obtain a first feature vector of the first text information and a second feature vector of the second text information;

a first calculation module for calculating a time warping distance between the first feature vector and the second feature vector;

and the second calculation module is used for calculating the similarity information between the first text information and the second text information according to the time warping distance.

A third aspect of embodiments of the present application provides an electronic device, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the text processing method described above.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program executable by a processor to perform the text processing method described above.

According to the technical scheme provided by the embodiment of the application, the feature vector of each text message is obtained by analyzing the plurality of text messages, then the obtained plurality of feature vectors are calculated to obtain the time warping distance between every two text messages, and then the similarity information between every two text messages is obtained by calculating the time warping distance, so that the text messages with lexical ambiguity are more accurately distinguished.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a text processing method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating the sub-steps of step 210 according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of establishing a predetermined feature recognition model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a text information processing apparatus according to an embodiment of the present application.

Reference numerals:

100-an electronic device; 110-a bus; 120-a processor; 130-a memory; 500-a text information processing apparatus; 510-an obtaining module; 520-a resolution module; 530-a first calculation module; 540-second calculation module.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

In the description of the present application, the terms "first," "second," and the like are used for distinguishing between descriptions and do not denote an order of magnitude, nor are they to be construed as indicating or implying relative importance.

In the description of the present application, the terms "comprises," "comprising," and/or the like, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.

In the description of the present application, the terms "mounted," "disposed," "provided," "connected," and "configured" are to be construed broadly unless expressly stated or limited otherwise. For example, it may be a fixed connection, a removable connection, or a unitary construction; can be mechanically or electrically connected; either directly or indirectly through intervening media, or may be internal to two devices, elements or components. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

Please refer to fig. 1, which is a schematic structural diagram of an electronic device 100 according to an embodiment of the present application, and includes at least one processor 120 and a memory 130, where fig. 1 illustrates one processor as an example. The processors 120 and the memory 130 are coupled by a bus 110, and the memory 130 stores instructions executable by the at least one processor 120, the instructions being executed by the at least one processor 120 to cause the at least one processor 120 to perform a real-time computing task processing method as in the embodiments described below.

In one embodiment, the Processor 120 may be a general-purpose Processor, including but not limited to a Central Processing Unit (CPU), a Network Processor (NP), etc., a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the processor 120 is the control center of the electronic device 100 and connects the various parts of the entire electronic device 100 using various interfaces and lines. The processor 120 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application.

In one embodiment, the Memory 130 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, including but not limited to Random Access Memory (RAM), Read Only Memory (ROM), Static Random Access Memory (SRAM), Programmable Read-Only Memory (PROM), Erasable Read-Only Memory (EPROM), electrically Erasable Read-Only Memory (EEPROM), and the like.

In one embodiment, the electronic device 100 may also communicate with one or more external devices, such as a keyboard, a mouse, a bluetooth device, a pointing device, etc., to enable a user to interact with the electronic device 100.

The structure of the electronic device 100 shown in fig. 1 is merely illustrative, and the electronic device 100 may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.

As shown in fig. 2, which is a flowchart illustrating a text processing method according to an embodiment of the present application, the method may be executed by the electronic device 100 shown in fig. 1 to reduce development difficulty and improve development efficiency. The method comprises the following steps:

step 210: and acquiring first text information and second text information to be processed.

In the above steps, the first text information and the second text information include, but are not limited to: short text information such as news abstracts, microblog messages, commodity comments and the like.

Step 220: and respectively analyzing the first text information and the second text information to obtain a first feature vector of the first text information and a second feature vector of the second text information.

Step 230: a time warping distance between the first feature vector and the second feature vector is calculated.

In the above step, include: calculating a Time warping distance between the first characteristic vector and the second characteristic vector according to a Dynamic Time Warping (DTW) method; when calculating the time warping distance, calculating a projection matrix of the first eigenvector and a projection matrix of the second eigenvector by a typical Correlation analysis method cca (systematic Correlation analysis); and the projection matrix of the first eigenvector and the projection matrix of the second eigenvector are used for calculating the time warping distance.

In one operation: the first step is as follows: inputting a first feature vector X and a second feature vector Y; second step, initialize V_x＝I_dxAnd V_y＝I_dyWherein, in the step (A),

is a projection matrix of the first eigenvector X,

a projection matrix of a second eigenvector Y, I_dx、I_dyIs an identity matrix; the third step: calculating Wx and Wy according to a preset formula of dynamic programming, and executing the step in a circulating way, wherein the preset formula of dynamic programming is as follows:

wherein W is the efficiency matrix of the member task, and the value W of each item in W_ijRepresents a member x_iCompletion of task t_jRequired workerWhen the temperature of the water is higher than the set temperature,

for aligning the binary selection matrix of sequence X with sequence Y, W_xAnd W_yEncoding an alignment path; the fourth step: to align the two sequences, vectors are extracted from the representative variables Vx and Vy as generalized eigenvectors that introduce the coefficient b, and satisfy the formula:

until the time warping distance obtains the minimum value, finishing the calculation and outputting the minimum value; wherein

J_ctwThe calculation formula of (2) is as follows:

wherein F is a norm.

Step 240: and calculating the similarity information between the first text information and the second text information according to the time warping distance.

In the above steps, the similarity information is calculated by using the following formula:

where s1 is the first text message and s2 is the second text message, ctw(s)₁,s₂) Represents the time warping distance, Sim(s), between the first text information s1 and the second text information s2₁,s₂) Is the final similarity information.

As shown in fig. 3, which is a schematic flow chart of the sub-steps of step 210 according to an embodiment of the present application, step 220: analyzing the first text information and the second text information respectively to obtain a first feature vector of the first text information and a second feature vector of the second text information, which may include:

step 221: and segmenting the first text information to obtain a first keyword set, and segmenting the second text information to obtain a second keyword set.

In the above steps, the model vocabulary may be loaded, a word segmentation device may be constructed, and then the word segmentation device may be used to perform word segmentation operation on the first text information and the second text information, so as to obtain a first keyword set and a second keyword set. And finally, performing part-of-speech tagging on the first keyword set and the second keyword set after the word segmentation operation by using a word segmentation device.

Step 222: and respectively inputting the first keyword set into a preset feature recognition model, outputting a first feature vector, inputting the second keyword set into the feature recognition model, and outputting a second feature vector.

In the above steps, each keyword is converted into a one-dimensional vector by querying the term vector table, and a first feature vector of the first text information and a second feature vector of the second text information are output, wherein the first feature vector and the second feature vector are vector representations corresponding to each keyword and combined with full-text semantic information.

As shown in fig. 4, which is a schematic flow chart illustrating a step of establishing a preset feature recognition model in an embodiment of the present application, the step of establishing the preset feature recognition model includes:

step 310: and acquiring a sample corpus, wherein the sample corpus is labeled with text words and syntactic structure characteristics.

In the above steps, before the preset feature recognition model processes the first text information and the second text information, a large batch of text corpora needs to be used for pre-training the preset feature recognition model, and for relevant processing of a news short text, the news corpora is used for training, so that the effect is better.

Step 320: and training the two-way encoder characterization model by adopting the sample corpora to obtain a preset feature recognition model.

In the above steps, at least one word in the sample corpus is replaced with a word mask respectively to obtain a sample corpus including at least one word mask; then, inputting the sample corpus including at least one word mask code into a bidirectional encoder characterization model (BERT), and outputting a context vector of each word mask code in the at least one word mask code through the bidirectional encoder characterization model; determining a word vector corresponding to each word mask based on the context vector and the word vector parameter matrix of each word mask respectively; and finally, training the two-way encoder characterization model based on the word vector corresponding to each word mask until a preset training completion condition is met, and obtaining a preset feature recognition model. The number of neurons in a hidden layer of a bidirectional encoder characterization model is 768, so the dimensionality of a feature vector of each piece of text information output by processing is 768, a short text array is generally used as input in the processing process, and the output result is also a sequence of 768-dimensional feature vectors.

As shown in fig. 5, which is a schematic structural diagram of a text information processing apparatus 500 according to an embodiment of the present application, the apparatus can be applied to the electronic device 100 shown in fig. 1, and includes: an acquisition module 510, a parsing module 520, a first calculation module 530, and a second calculation module 540. The principle relationship of the modules is as follows:

the obtaining module 510 is configured to obtain first text information and second text information to be processed.

The parsing module 520 is configured to parse the first text information and the second text information respectively to obtain a first feature vector of the first text information and a second feature vector of the second text information.

A first calculating module 530, configured to calculate a time warping distance between the first feature vector and the second feature vector.

The second calculating module 540 is configured to calculate similarity information between the first text information and the second text information according to the time warping distance.

For a detailed description of the text information processing apparatus 500, please refer to the description of the related method steps in the above embodiments.

An embodiment of the present invention further provides a storage medium readable by an electronic device, including: a program that, when run on an electronic device, causes the electronic device to perform all or part of the procedures of the methods in the above-described embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like. The storage medium may also comprise a combination of memories of the kind described above.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

It should be noted that the functions, if implemented in the form of software functional modules and sold or used as independent products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of text processing, comprising:

acquiring first text information and second text information to be processed;

2. The method of claim 1, wherein the calculating a time warping distance between the first feature vector and the second feature vector comprises:

and calculating the time warping distance between the first feature vector and the second feature vector according to a dynamic programming method (DTW).

3. The method of claim 2, wherein the calculating a time warping distance between the first feature vector and the second feature vector according to a dynamic programming method (DTW) comprises:

when the time warping distance is calculated, calculating a projection matrix of the first eigenvector and a projection matrix of the second eigenvector by a typical correlation analysis method CCA; wherein the projection matrix of the first eigenvector and the projection matrix of the second eigenvector are used to calculate the time warping distance.

4. The method according to claim 1, wherein said calculating similarity information between the first text information and the second text information according to the time warping distance comprises:

calculating the similarity information by adopting the following formula:

5. The method of claim 1, wherein the parsing the first text message and the second text message respectively to obtain a first feature vector of the first text message and a second feature vector of the second text message comprises:

6. The method of claim 5, further comprising: the step of establishing a preset feature recognition model comprises the following steps:

7. The method of claim 6, wherein the number of hidden layer neurons in the bi-directional encoder characterization model is 768.

8. A text processing apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the text processing method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the text processing method of any one of claims 1-7.