WO2023004528A1 - 一种基于分布式***的并行化命名实体识别方法及装置 - Google Patents

一种基于分布式***的并行化命名实体识别方法及装置 Download PDF

Info

Publication number
WO2023004528A1
WO2023004528A1 PCT/CN2021/108313 CN2021108313W WO2023004528A1 WO 2023004528 A1 WO2023004528 A1 WO 2023004528A1 CN 2021108313 W CN2021108313 W CN 2021108313W WO 2023004528 A1 WO2023004528 A1 WO 2023004528A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
hidden state
information
text block
entity recognition
Prior art date
Application number
PCT/CN2021/108313
Other languages
English (en)
French (fr)
Inventor
包先雨
马群凯
吴绍精
郭云
周长春
彭锦学
程立勋
郑文丽
蔡屹
Original Assignee
深圳市检验检疫科学研究院
全国海关信息中心(全国海关电子通关中心)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市检验检疫科学研究院, 全国海关信息中心(全国海关电子通关中心) filed Critical 深圳市检验检疫科学研究院
Priority to PCT/CN2021/108313 priority Critical patent/WO2023004528A1/zh
Publication of WO2023004528A1 publication Critical patent/WO2023004528A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • This application relates to the field of identity authentication, in particular to a distributed system-based parallel named entity recognition method and device.
  • Named entity recognition is a basic research in the field of natural language processing, and is closely related to applications such as entity information extraction, relationship extraction, syntax analysis, and knowledge graph construction.
  • the practicability of the named entity recognition model is the criterion for judging the pros and cons. Its main function is to automatically identify the boundaries of text sentence entities. It plays an important role in the fields of corporate annual reports, test reports, judicial documents, online shopping evaluations, and medical guidelines.
  • the evaluation indicators are concentrated in Both accuracy and efficiency.
  • contradictory benefits in these two aspects, that is, when the accuracy rate is improved, the efficiency will be reduced, and to a certain extent, there will be a phenomenon of rising and falling, making it difficult to improve the overall performance of the model. Therefore, it is very important to reduce the degree of contradictory benefits of accuracy and efficiency and enhance the practicality.
  • BiLSTM Bidirectional LSTM
  • CNN CNN
  • CRF named entity recognition
  • Ma et al learned from word-level and character-level representations by using a combination of bidirectional LSTM, CNN, and CRF, and achieved end-to-end in the true sense
  • Ma Xiaofei in The LSTM layer and the CRF layer add the attention mechanism layer, so that the model can better mine the local features of the sequence, and realize the named entity recognition for Chinese social media
  • Cao et al. learn the dependence of any two characters in the sentence through the local attention mechanism Perform named entity recognition on Weibo text
  • Li Mingyang et al combined self-attention mechanism to build LSTM-SelfAtt-CRF model to perform named entity recognition on social media.
  • the deep learning method highlights the importance of feature learning, using big data for feature learning to obtain the intrinsic information of the data, and the performance of named entity recognition is better.
  • BERT-BAC BERT-BiGRU-Attention-CRF
  • this application is proposed to provide a distributed system-based parallel named entity recognition method and device that overcomes the above problems or at least partially solves the above problems, including:
  • a method for parallelized named entity recognition based on a distributed system comprising:
  • the word embedding vector obtained by each text block through the embedding layer is input into the BiGRU neural network to mine the global features of the text block, and the forward hidden state and reverse hidden state of the text block at time t are obtained;
  • the feature information of the text block is weighted and summed through a fully connected layer, and an output feature vector is generated through an activation function softmax;
  • the best sequence is output as the entity recognition result.
  • the step of acquiring data text information, and performing text division on the data text information according to named entity recognition rules to generate several text blocks includes:
  • the right boundary of the sliding window is used as a dividing line to divide the data text information, and the data text information located on the left side of the dividing line after division
  • the information section is set to the text block.
  • it also includes a second preset value, wherein the second preset value is smaller than the first preset value, and further includes the steps of:
  • the sliding window is driven to slide forward by a preset unit step.
  • the sliding window is driven to slide forward by a preset unit step.
  • the word embedding vector obtained by each text block through the embedding layer is input into the BiGRU neural network to mine the global feature of the text block, and obtain the forward hidden state and the reverse hidden state of the text block at time t.
  • the BiGRU neural network to mine the global feature of the text block, and obtain the forward hidden state and the reverse hidden state of the text block at time t.
  • the forward information and backward information of the text block are summarized and merged; specifically, the GRU neural network corresponding to the forward generates the forward hidden state through the forward information and embedded input words, corresponding to The backward GRU neural network generates the reverse hidden state through the backward information and embedding input words.
  • the step of taking the forward hidden state and the reverse hidden state as the current time step, supplementing the local features of the text block through the local attention mechanism, and outputting the feature information of the current time step includes:
  • the forward hidden state uses the forward hidden state, the reverse hidden state, the forward context vector and the reverse context vector as training conditions to generate the feature information; wherein the feature information includes forward feature information and Reverse feature information.
  • the step of weighting and summing the feature information of the text block through the fully connected layer, and generating an output feature vector through the activation function softmax includes;
  • the output feature vector is generated according to the summation result through an activation function softmax.
  • the step of inputting feature information and random initial vectors into a pre-built single-layer perceptron network to jointly learn to generate hidden layer representations and hidden state correlation vectors of upper and lower layers includes
  • the hidden layer representation is obtained by the following formula:
  • i represents the i-th text block
  • j represents the j-th time step
  • u ij represents the hidden layer representation output by S ij after passing through the single-layer perceptron network
  • S ij represents the S i under the j-th time step .
  • a parallelized named entity recognition device based on a distributed system specifically comprising:
  • a block module configured to obtain data text information, and perform text division on the data text information according to named entity recognition rules to generate several text blocks;
  • the reverse hidden state acquisition module is used to input the word embedding vector obtained by each text block through the embedding layer into the BiGRU neural network to mine the global features of the text block, and obtain the forward hidden state and reverse hidden state of the text block at time t. hidden state;
  • the feature information generation module is used to take the forward hidden state and the reverse hidden state as the current time step, supplement the local features of the text block through the local attention mechanism, and output the feature information of the current time step;
  • the output feature vector generation module is used to weight and sum the feature information of the text block through the fully connected layer, and generate the output feature vector by the activation function softmax;
  • the entity recognition result generation module is used for the regularized maximum likelihood estimation of the CRF model, and outputs the best sequence as the entity recognition result.
  • several text blocks are generated by obtaining data text information and performing text division on the data text information according to named entity recognition rules; the word embedding vectors obtained by each text block through the embedding layer are input into BiGRU.
  • the global features of the text block are mined in the neural network, and the forward hidden state and reverse hidden state of the text block at time t are obtained; the forward hidden state and reverse hidden state are used as the current time step, and the text is supplemented by a local attention mechanism.
  • Fig. 1 is a Block-BAC model schematic diagram of a parallel named entity recognition method based on a distributed system provided by an embodiment of the present application;
  • Fig. 2 is a MapReduce schematic diagram of a parallel named entity recognition method based on a distributed system provided by an embodiment of the present application;
  • FIG. 3 is a flow chart of a distributed system-based parallel named entity recognition method provided by an embodiment of the present application
  • FIG. 4 is a flow chart of a CDC-based data block optimization algorithm based on a distributed system-based parallel named entity recognition method provided by an embodiment of the present application;
  • Fig. 5 is a schematic diagram of a local attention mechanism of a parallel named entity recognition method based on a distributed system provided by an embodiment of the present application;
  • FIG. 6 is a schematic diagram of a model training runtime effect comparison of a parallelized named entity recognition method based on a distributed system provided by an embodiment of the present application;
  • Fig. 7 is a schematic diagram of a comparison of entity recognition running time effects of a parallel named entity recognition method based on a distributed system provided by an embodiment of the present application;
  • Fig. 8 is a structural block diagram of a parallel named entity recognition device based on a distributed system provided by an embodiment of the present application.
  • the method is based on the Hadoop architecture and includes five parts: text block preprocessing, BiGRU neural network, local attention mechanism, full connection and CRF, as shown in Figure 1.
  • BiGRU and CRF directly use the method used by the BERT-BAC model.
  • the text is preprocessed in blocks according to parallelization requirements. Then, enter the Map parallel processing stage, input the word embedding vector obtained by each text block through the embedding layer into the BiGRU neural network, mine the global features of the text block, and obtain the hidden state sum of the text block at time t. Then, with the sum as the current time step, the local attention mechanism is used to complement the local features of the text block, and the feature information h t of the current time step is output. Finally, enter the Reduce stage to set the weighted summation of the feature information h t of all text blocks in the fully connected layer, and connect the softmax to obtain the output feature vector. And through the regularized maximum likelihood estimation of the CRF model, the best sequence is output to obtain the entity recognition result.
  • the parallel processing based on Hadoop is not limited to each text block of a text of a certain device, and similarly can also realize the parallel processing of multiple texts of different devices, and integrate multiple entities recognized by different devices from multiple sources. In practical application, it can reduce human resources for operating equipment, shorten data processing time, and improve work efficiency.
  • HDFS is a distributed file system that provides an efficient way to manage data with high fault tolerance and across clusters
  • MapReduce is used for parallel processing of large-scale data sets.
  • MapReduce uses MapReduce to achieve high-performance parallel computing, and uses HDFS to complete the underlying data storage.
  • MapReduce is mainly divided into two stages of Map and Reduce.
  • the processing process directly calls the parallel processing of the model.
  • the block is processed.
  • On each node, the Map will be processed at the same time.
  • the parallelization of Hadoop is reflected here. It is composed of the embedding layer, the BiGRU layer and the local attention optimization mechanism layer in series.
  • the Reduce stage set the fully connected layer channel to combine the results of the Map output, and finally connect the CRF layer to output the recognition results.
  • An example of MapReduce is shown in Figure 2.
  • the data block is processed by Map, the entity m of the data block n is identified, and the characteristic information h t (n, m) is obtained; and then partitioned according to the identified entity m output by the Map, and the partition is defined by the user Partition function control; each Reduce task corresponds to a partition, and multiple Reduce stages are independently specified, and the entity feature information of the Reduce corresponding partitions is weighted and summed, and the result is output.
  • the commonly used single entity recognition requires only one Reduce task to directly merge and process the data output by the Map.
  • the BERT-BAC model uses the BERT pre-training language to increase the information expression of the word vector, and then uses the global attention mechanism to fully mine the internal features of the text, and adopts the method of increasing hidden layer nodes to improve The F1 value strategy increased the F1 value by 8%.
  • the advantage of this method is that the two-way BERT pre-training language is used, and the information expression is better than the feature-based and fine-tuning-based methods. Representing words as vectors can make word vectors fit the context. Secondly, the global attention mechanism is used to learn the information between words in sentences, and selectively focus on important information, which can meet the needs of accurately identifying entities. But it also caused the opposite effect:
  • FIG. 3 shows a distributed system-based parallel named entity recognition method provided by an embodiment of the present application.
  • the method includes:
  • S310 Acquire data text information, and perform text division on the data text information according to named entity recognition rules to generate several text blocks;
  • several text blocks are generated by obtaining data text information and performing text division on the data text information according to named entity recognition rules; the word embedding vectors obtained by each text block through the embedding layer are input into BiGRU.
  • the global features of the text block are mined in the neural network, and the forward hidden state and reverse hidden state of the text block at time t are obtained; the forward hidden state and reverse hidden state are used as the current time step, and the text is supplemented by a local attention mechanism.
  • the data text information is obtained, and the data text information is text-divided according to the named entity recognition rules to generate several text blocks;
  • step S310 the specific process of "obtaining data text information, and performing text division on the data text information to generate several text blocks" in step S310 can be further described in conjunction with the following description.
  • the right boundary of the sliding window is used as the dividing line to divide the data text information, and place the divided The text information portion of the data to the left of the line is set as the text block.
  • the value of the first preset value depends on the Hash algorithm, and setting according to the characteristics of the text will have a better effect, preferably 800B.
  • step S310 the specific process of "obtaining data text information, and performing text division on the data text information to generate several text blocks" in step S310 can be further described in conjunction with the following description.
  • the sliding window is driven to slide forward by a preset unit step.
  • the value of the second preset value depends on the Hash algorithm, and setting according to the characteristics of the text will have a better effect, preferably 500B.
  • step S310 the specific process of "obtaining data text information, and performing text division on the data text information to generate several text blocks" in step S310 can be further described in conjunction with the following description.
  • the right boundary of the sliding window is used as a dividing line to divide the data text information, and the part of the data text information located on the left side of the dividing line after division is set as the A text block; if not, the sliding window is driven to slide forward by a preset unit step.
  • step S310 the specific process of "obtaining data text information, and performing text division on the data text information to generate several text blocks" in step S310 can be further described in conjunction with the following description.
  • the sliding window is driven to slide forward by a preset unit step.
  • the content variable-length chunking (CDC) algorithm is a strategy of applying Rabin fingerprints to divide text into chunks of different lengths. Use a fixed-size sliding window to divide the text. When the Rabin fingerprint value of the sliding window matches the expected value, divide a segmentation point at this position, and repeat this process until the entire text is divided, so that the text will be divided according to the preset value. The specified segmentation point is divided into text blocks.
  • step 6 compare the calculated Hash value with the Hash value obtained by the previous two calculations, if the Hash value is equal to the Hash value obtained by the previous two calculations, then divide the position according to step 7; if not, press Step 2 moves forward one unit.
  • n text blocks are obtained, and the text blocks are sorted from large to small bytes.
  • n ⁇ Map node number the text blocks are sorted and sent to specific nodes for Map processing, with the largest
  • the byte text block 1 enters the Map stage based on the end of Map parallel processing.
  • n the number of Map nodes
  • the algorithm divides the text into text blocks, which minimizes the probability of inaccurate output results due to semantic interruption caused by text block division, and increases the entity recognition effect of the model.
  • block can reduce the calculation of the forgotten amount of past information in neural network learning between two adjacent paragraphs, and improve the entity recognition efficiency of the model.
  • the word embedding vector obtained by each text block through the embedding layer is input into the BiGRU neural network to mine the global features of the text block, and the forward hidden state and reverse hidden state of the text block at time t are obtained ;
  • step S320 described in step S320 can be further described in conjunction with the following description: "The word embedding vector obtained by each text block through the embedding layer is input into the BiGRU neural network to mine the global features of the text block to obtain the text block at time t. The specific process of "forward hidden state and reverse hidden state”.
  • the forward information and backward information of the text block are summarized and merged; specifically, the GRU neural network corresponding to the forward generates the forward information through the forward information and embedded input words.
  • the backward hidden state is corresponding to the backward GRU neural network to generate the reverse hidden state through the backward information and embedding input words.
  • step S330 take the forward hidden state and the reverse hidden state as the current time step, supplement the local features of the text block through the local attention mechanism, and output the feature information of the current time step;
  • step S330 take the forward hidden state and reverse hidden state as the current time step, supplement the local features of the text block through the local attention mechanism, and output the current time step The specific process of characteristic information.
  • the feature information is generated by using the forward hidden state, the reverse hidden state, the forward context vector and the reverse context vector as training conditions; wherein, the feature information Including forward feature information and reverse feature information.
  • the BiGRU neural network can fully obtain features globally, but there are insufficient local features.
  • the hidden state h t output by two unidirectional and opposite GRUs is the time step , use the local attention mechanism to fully obtain the local features around this time step, obtain the internal structure information of the sentence, reduce the calculation amount of the global attention, and make the output result accurate and fast.
  • the idea of the attention mechanism is that when people observe an image, they mostly focus on a specific part according to their own needs, so the scope of the traditional local attention mechanism is optimized, as shown in Figure 4.
  • the vector ⁇ t is the calculation correction vector
  • hi is the hidden vector
  • h t is the center of the window, directly making it equal to the current time step t, obtained through training, namely:
  • is a sigmoid function
  • V p and W p are trainable parameters. Therefore, h t calculated in this way is a floating-point number, but this has no effect, because when calculating the calibration weight vector ⁇ t , a mean value h t is added.
  • the feature information of the text block is weighted and summed through the fully connected layer, and the output feature vector is generated through the activation function softmax;
  • step S340 the specific process of "weighting and summing the feature information of the text block through the fully connected layer and generating the output feature vector through the activation function softmax" in step S340 can be further described in conjunction with the following description.
  • a weight vector is determined according to the hidden layer representation and the hidden state correlation vector of the upper and lower layers;
  • the feature information is weighted and summed according to the weight vector to obtain a summation result
  • the output feature vector is generated according to the summation result through the activation function softmax.
  • step S340 described in step S340 can be further described in conjunction with the following description: "The word embedding vectors obtained by each text block through the embedding layer are input into the BiGRU neural network to mine the global features of the text block to obtain the text block at time t. The specific process of "forward hidden state and reverse hidden state”.
  • the hidden layer representation is obtained by the following formula:
  • i represents the i-th text block
  • j represents the j-th time step
  • u ij represents the hidden layer representation output by S ij after passing through the single-layer perceptron network
  • S ij represents the S i under the j-th time step .
  • the fully connected layer is used as the Reduce stage of the Hadoop architecture, referring to the operation rules of Sun Huadong in the self-attention mechanism and global attention mechanism modules.
  • the local attention mechanism layer does not need to pay attention to the connection between the data block and other data blocks, each data block is independent, and the important feature information h i of the current data block is extracted through the local attention mechanism layer.
  • the feature information h i of all data blocks is weighted and summed to obtain the feature vector, and the internal information between the data blocks of the text is supplemented.
  • the activation function softmax is connected to output the final result.
  • the calculation formula is as shown in (4) ⁇ (6). Show
  • u ij represents the hidden layer representation obtained by passing S ij through a single-layer perceptron network; Changes will be updated during training, shared by all time steps; ⁇ ij represents the inner product of u ij and U ⁇ , and then the weight vector obtained by softmax normalization; h i represents the weighted summation of S i to obtain the characteristics of the current data block vector.
  • the higher-level hidden layer representation U ij is obtained through formula (4), and then the weight vector ⁇ ij is obtained through the softmax normalization of formula (5), which represents the hidden state weight coefficient of the jth time step in the current data block , and finally carry out weighted summation through formula (6), and obtain the feature vector representation h i of the current data block.
  • the eigenvector hi obtained by each data block is used as an input vector through the second fully connected layer, which is the same as the calculation method (4)-(6) of the eigenvector of a single data block above. Finally, connect the activation function f to get the output result, f is softmax, namely:
  • step S350 through the regularized maximum likelihood estimation of the CRF model, the best sequence is output as the entity recognition result.
  • the programming language is VC++
  • the software development platform is VisualStudio2015
  • the operating system is Win10 ⁇ 64
  • the CPU is Intel Core i7
  • the main frequency is 3.91GHz
  • the memory is 16GB.
  • the word embedding dimension is set to 350 dimensions
  • the hidden layer of the GRU network is set to 128 dimensions
  • Set 10 Map nodes in the Map stage, and set m 7 in the Reduce stage.
  • the corpus used is from The total number of test reports provided by Shenzhen Customs is 8,300.
  • the entity extraction problem of the test report is slightly different from the identification of 7 types of specific entities such as the traditional name, address, and company. In the key data extraction of the test report, it is necessary to define a complex entity label set.
  • the entities that need to be defined include: sample number, inspection There are seven types of entities including date, sample item, testing instrument, testing method, sample volume, and report result.
  • test report is preprocessed, mainly including three parts: de-drying, word segmentation processing, and word tagging. All tagging work is done manually by professionally trained personnel.
  • Labeling system The labeling system used in the experiment is BIO, where B indicates the beginning of the entity, I indicates the other parts of the entity except the initial position, and O indicates the non-entity. Then it is randomly selected and divided into training set, development set, and test set according to the ratio of 3:1:1.
  • the distribution data of the detection report entity sample is shown in Table 1.
  • This method uses precision (P), recall (R) and F1 value, three general evaluation indicators for named entity recognition, to evaluate the performance of the proposed method.
  • the three evaluation indicators are specifically defined as:
  • the laboratory test report was used as a sample, and on the same data set, the P, R, and F1 values of different entities in 30 test reports on different models were counted.
  • the BAC model BiGRU-Attention-CRF model
  • the average value and the comparison results of the model experiment are shown in Table 2.
  • Block-BAC model proposed by this method has significantly improved accuracy, recall and F1 value compared with the BAC model, and the F1 value has increased by 2.55%; compared with the BERT-BAC model, The F1 value drops by 1.63%, which is slightly insufficient in the entity recognition effect.
  • accuracy and recall rate of the Block-BAC model have been greatly improved on the basis of BAC.
  • the improvement rate is not as large as that of the BERT-BAC model, it still guarantees a high F1 value. Entity recognition works well.
  • the duration of model training and entity recognition is also an important performance analysis index. Therefore, the laboratory test report is randomly selected, the performance of the model is tested by continuously increasing the number of training samples, the model training module is tested by increasing the number of recognition samples, and the time results of the program running time are plotted separately into Figure 6 and Figure 7.
  • the Block-BAC model shortens the training time by 60.36%.
  • Block-BAC model proposed by this method aims at the lack of efficiency in model training and entity recognition processing.
  • a CDC-based data block optimization algorithm is designed for preprocessing, and a parallel operation mechanism is realized based on the Hadoop architecture. It becomes an effective strategy to shorten the time of model training and entity recognition; then optimizes the local attention mechanism, which reduces the use of invalid hidden layer nodes compared with the global attention mechanism; finally connects the fully connected layer to obtain the feature vector of the data block, And make up the inner connection between the data blocks.
  • the model training time is shortened by 60.36%
  • the entity recognition time is shortened by 39.34%.
  • the F 1 value is taken into account, which greatly reduces the degree of inversion of benefits, and is more practical and suitable for users. Usage requirements.
  • the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiment.
  • FIG. 8 shows a distributed system-based parallel named entity recognition device provided by an embodiment of the present application.
  • Blocking module 810 configured to obtain data text information, and perform text division on the data text information according to named entity recognition rules to generate several text blocks;
  • the hidden state acquisition module 820 is used to input the word embedding vectors obtained by each text block through the embedding layer into the BiGRU neural network to mine the global features of the text block, and obtain the forward hidden state and reverse hidden state of the text block at time t state;
  • the feature information generation module 830 is used to take the forward hidden state and the reverse hidden state as the current time step, supplement the local features of the text block through the local attention mechanism, and output the feature information of the current time step;
  • the output feature vector generation module 840 is used to weight and sum the feature information of the text block through the fully connected layer, and generate the output feature vector through the activation function softmax;
  • the entity recognition result generating module 850 is configured to output the best sequence as the entity recognition result through the regularized maximum likelihood estimation of the CRF model.
  • the block module 810 includes:
  • the sliding window generation submodule is used to determine the size of the sliding window, and slide from the beginning of the data text information with a preset unit step until the sliding window is loaded by data;
  • the current Hash value determination submodule is used to determine the current Hash value of the text block formed by the current position of the sliding window
  • the first text block generation submodule is used to divide the data text information with the right boundary of the sliding window as a dividing line when the current Hash value is greater than the first preset value, and divide the divided The part of the data text information located on the left side of the dividing line is set as the text block.
  • it also includes a second preset value, wherein the second preset value is smaller than the first preset value, and further includes:
  • the first sliding window driving sub-module is used to drive the sliding window to slide forward by one preset unit step when the current Hash value is smaller than the second preset value.
  • a historical Hash value acquisition submodule used to obtain the most adjacent to the current Hash value when the value of the current Hash value is between the first preset value and the second preset value The first historical Hash value and the next adjacent second historical Hash value;
  • a judging submodule configured to judge whether the current Hash value is equal to the first historical Hash value and the second historical Hash value
  • the second text block generation sub-module is used to, if so, use the right boundary of the sliding window as a dividing line to divide the data text information, and place the data text information part on the left side of the dividing line after division Set as the text block; the second sliding window driving submodule is used to drive the sliding window to slide forward by a preset unit step if not.
  • a split position judging submodule used to judge whether the sliding window has reached the end of the data text information
  • the third sliding window driving sub-module is used to drive the sliding window to slide forward by a preset unit step if not.
  • the hidden state acquisition module 820 includes:
  • the neural network building sub-module is used to build two independent GRU neural networks
  • the hidden state generation submodule is used for summarizing and merging the forward information and backward information of the text block; specifically, the GRU neural network corresponding to the forward is generated by the forward information and embedded input words
  • the forward hidden state corresponds to the backward GRU neural network generating the reverse hidden state through the backward information and embedded input words.
  • the characteristic information generation module 830 includes:
  • the context vector determination sub-module is used to determine the forward context vector and the reverse context vector using the local attention mechanism with a fixed window size, taking the forward hidden state and the reverse hidden state as the current time step;
  • a feature information generating submodule configured to use the forward hidden state, the reverse hidden state, the forward context vector, and the reverse context vector as training conditions to generate the feature information; wherein, the Feature information includes forward feature information and reverse feature information.
  • the output feature vector generation module 840 includes;
  • the common learning sub-module is used to input feature information and random initial vectors to the pre-built single-layer perceptron network for joint learning to generate hidden layer representations and hidden state correlation vectors of upper and lower layers;
  • a weight vector determining submodule configured to determine a weight vector according to the hidden layer representation and the upper and lower layer hidden state association vectors
  • a weighted summation submodule configured to perform weighted summation on the feature information according to the weight vector to obtain a summation result
  • the output feature vector generation sub-module is used to generate the output feature vector according to the summation result through the activation function softmax.
  • the common learning submodule includes
  • the hidden layer representation is obtained by the following formula:
  • i represents the i-th text block
  • j represents the j-th time step
  • u ij represents the hidden layer representation output by S ij after passing through the single-layer perceptron network
  • S ij represents the S i under the j-th time step .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供了一种基于分布式***的并行化命名实体识别方法及装置,方法包括:通过获取数据文本信息,并依据命名实体识别规则对数据文本信息进行文本划分生成若干文本块;将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;通过全连接层对文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。有效减少模型的隐层节点数量。

Description

一种基于分布式***的并行化命名实体识别方法及装置 技术领域
本申请涉及身份认证领域,特别是一种基于分布式***的并行化命名实体识别方法及装置。
背景技术
命名实体识别(named entity recognition,NER)问题属自然语言处理领域中的基础研究,与实体信息抽取、关系抽取、句法分析、知识图谱构建等应用有着密切联系。命名实体识别模型的实用性是评判优劣的标准,主要作用是自动识别文本语句实体的边界,在企业年报、检测报告、司法文书、网购评价、医疗指南等领域发挥重要作用,评判指标集中在准确率和效率两方面。然而,这两方面往往存在效益背反现象,即在提升准确率的时候会使效率降低,在一定程度上呈现此涨彼消的现象,使得模型总体性能难以提升。因此,降低准确率和效率的效益背反程度,增强实用性的研究显得至关重要。
在早期,命名实体识别技术使用基于规则和字典的方法,但需要人工建立复杂的规则和字典,工作量大却适用范围小。随着自然语言处理技术的发展,基于统计模型的浅层学习方法得到广泛的应用,如马尔科夫模型(HMM)、条件随机场(CRF)、最大熵模型(ME)等,解决了需要人工建立字典和适用范围小的问题,但过度依赖语料库和大量超参数的设定,需要设计员具有丰厚的经验和技巧。在深度学习的推动下,CNN、LSTM、GRU等神经网络被大规模应用于命名实体识别领域。如:Huang等首次运用BiLSTM,结合CRF运用于命名实体识别;Ma等通过使用双向LSTM,CNN和CRF的组合,从词级和字符级表示中学习,真正意义上做到端到端;马晓菲在LSTM层与CRF层加入了注意力机制层,使模型更好地挖掘序列的局部特征,实现面向中文社交媒体的命名实体识别;Cao等通过局部注意力机制学习句子中任意2个字符的依赖关系对微博文本进行命名实体识别;李明扬等结合自注意力机制构建LSTM-SelfAtt-CRF模型对社交媒体进行命名实体识别。相比于浅层学习,深度学习方法突出了特征学习的重要性,利用 大数据进行特征学习,获得数据的内在信息,命名实体识别的性能更优。
在实际运用时,命名实体识别的准确率是首要条件,因此提升模型F1值成为研究的重点。在这一研究方向中,张靖宜等构建的BERT-BiGRU-Attention-CRF(简称为BERT-BAC)模型具有代表性,但同时引起了效益背反问题,造成模型训练和命名实体识别时间过长。
发明内容
鉴于所述问题,提出了本申请以便提供克服所述问题或者至少部分地解决所述问题的一种基于分布式***的并行化命名实体识别方法及装置,包括:
一种基于分布式***的并行化命名实体识别方法,所述方法包括:
获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
进一步地,所述获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块的步骤,包括,
确定滑动窗口的大小,并从所述数据文本信息的起始处以预设单元步长进行滑动,直至所述滑动窗口被数据装载完毕;
确定所述滑动窗口当前位置组成的文本块的当前Hash值;
当所述当前Hash值大于第一预设值时,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边 所述数据文本信息部分设置为所述文本块。
进一步地,还包括第二预设值,其中所述第二预设值小于所述第一预设值,还包括步骤:
当所述当前Hash值小于所述第二预设值时,则驱动所述滑动窗口向前滑动一个预设单元步长。
进一步地,还包括步骤:
当所述当前Hash值的取值处于所述第一预设值和所述第二预设值之间时,则获取与所述当前Hash值最相邻的第一历史Hash值和次相邻的第二历史Hash值;
判断所述当前Hash值与所述第一历史Hash值以及所述第二历史Hash值是否相等;
若是,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块;若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
进一步地,还包括步骤:
判断所述滑动窗口是否抵达所述数据文本信息结尾处;
若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
进一步地,所述将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态的步骤,包括:
构建两个独立的GRU神经网络;
对所述文本块的前向信息和后向信息进行汇总并进行合并;具体地,对应于前向的GRU神经网络通过所述前向信息和嵌入输入词生成所述正向隐藏状态,对应于后向的GRU神经网络通过所述后向信息和嵌入输入词生成所述反向隐藏状态。
进一步地,所述以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息的步骤,包括:
以正向隐藏状态和反向隐藏状态为当前时间步,使用固定窗口大小的局部注意力机制确定正向上下文向量以及反向上下文向量;
将所述正向隐藏状态、所述反向隐藏状态、所述正向上下文向量和所述反向上下文向量作为训练条件,生成所述特征信息;其中,所述特征信息包括正向特征信息和反向特征信息。
进一步地,所述通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量的步骤,包括;
将特征信息和随机初始向量输入到预构建的单层感知机网络进行共同学习生成隐藏层表示和上下层隐含状态关联向量;
依据所述隐藏层表示和所述上下层隐含状态关联向量确定权重向量;
依据所述权重向量对所述特征信息进行加权求和,获得求和结果;
通过激活函数softmax依据所述求和结果生成所述输出特征向量。
进一步地,所述将特征信息和随机初始向量输入到预构建的单层感知机网络进行共同学习生成隐藏层表示和上下层隐含状态关联向量的步骤,包括
通过以下公式得出所述隐藏层表示:
设第i个文本块中通过局部注意力输出的正向特征信息为
Figure PCTCN2021108313-appb-000001
和反向特征信息为
Figure PCTCN2021108313-appb-000002
Figure PCTCN2021108313-appb-000003
u ij=tanh(W αS ij+b α)
式中,i表示第i个文本块,j表示第j个时间步,u ij表示将S ij经过单层感知机网络后输出的隐藏层表示;S ij表示第j个时间步下的S i
一种基于分布式***的并行化命名实体识别装置,具体包括:
分块模块,用于获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
反向隐藏状态获取模块,用于将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
特征信息生成模块,用于以正向隐藏状态和反向隐藏状态为当前时间 步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
输出特征向量生成模块,用于通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
实体识别结果生成模块,用于通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
本申请具有以下优点:
在本申请的实施例中,通过获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。有效减少模型的隐层节点。与已有的BERT-BAC模型相比,在确保较高F 1值(精确率和召回率的调和平均数)的情况下,该模型训练时间和实体识别时间分别缩短60.36%、39.43%,具有更广泛的实用性。
附图说明
为了更清楚地说明本申请的技术方案,下面将对本申请的描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的Block-BAC模型示意图;
图2是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的MapReduce示意图;
图3是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的流程图;
图4是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的基于CDC的数据分块优化算法流程图;
图5是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的局部注意力机制示意图;
图6是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的模型训练运行时间效果比较示意图;
图7是本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法的实体识别运行时间效果比较示意图;
图8是本申请一实施例提供的一种基于分布式***的并行化命名实体识别装置的结构框图。
具体实施方式
为使本申请的所述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
需要说明的是,所述方法基于Hadoop架构,包括文本分块预处理、BiGRU神经网络、局部注意力机制、全连接和CRF五个部分,如图1所示。其中,BiGRU和CRF直接使用了BERT-BAC模型所用方法。
在Hadoop架构上,结合文本在命名实体识别中的特点,根据并行化需求将文本进行分块预处理。然后,进入Map并行处理阶段,将各个文本块分别通过嵌入层得到的词嵌入向量输入BiGRU神经网络中,挖掘该文本块的全局特征,得到该文本块t时刻的隐藏状态和。接着,以和为当前时间步,采用局部注意力机制补足文本块局部特征,输出当前时间步的特征信息h t。最后,进入Reduce阶段设定全连接层的对所有文本块的特征信息h t进行加权求和,连接softmax得到输出特征向量。并通过CRF模型的正则化的极大 似然估计,输出最佳序列,得到实体识别结果。
其中,基于Hadoop的并行处理不仅限于在某个设备一个文本的各个文本块之间,同理还可以实现对不同设备多个文本的并行处理,将不同设备识别的多个实体多源融合。在实际运用中能够减少操作设备的人力资源,缩短数据处理时间,提高工作效率。
Hadoop的两个核心功能是分布式存储和数据并行处理,由HDFS和MapReduce实现。HDFS是分布式文件***,提供具有高容错性能和跨集群管理数据的有效方法;MapReduce用于大规模数据集的并行处理。
本方法采用MapReduce实现高性能并行化计算,并运用HDFS完成底层数据储存。其中,MapReduce主要分为Map和Reduce两个阶段进行工作,首先将文本进行分块成若干个小数据块,然后将数据块发送到具体的节点进行Map阶段处理,处理过程直接调用模型的并行处理版块进行处理,在每一个节点上,Map会同时进行处理,Hadoop的并行化便体现在此,由嵌入层,BiGRU层和局部注意力优化机制层串联组成。之后Reduce阶段设定全连接层通道将Map输出的结果合并,最后连接CRF层输出识别结果,MapReduce示例如图2所示。
通过图2所描述过程,可以看出数据块经过Map处理,识别数据块n的实体m,得到特征信息h t(n,m);然后根据Map输出的识别实体m进行分区,分区由用户定义的partition函数控制;每个Reduce任务对应一个分区,多个Reduce阶段是独立指定的,将Reduce对应分区的实体特征信息加权求和,输出结果。而常用的单个实体识别,只需一个Reduce任务,将Map输出的数据直接合并处理。
当出现文件过大的情况,设备对数据块于HDFS底层数据存储会出现内存不足的现象,此时***会出现异常,弹出“计算机的内存不足,请保存文件并关闭这些程序”对话框,需人为操作重启***,并将文件分成多份重新输入。并且,由于设备配置参数的不同,设置Map节点个数应能保证***的运行,以防出现“应用程序发生异常”的情况。
需要说明的是,BERT-BAC模型在BiGRU-CRF的基础上,使用了BERT 预训练语言来增加词向量的信息表达,然后运用全局注意力机制充分挖掘文本的内部特征,采取增加隐层节点提升F1值的策略,将F1值提升了8%。
该方法的优点在于使用双向的BERT预训练语言,信息表达比基于特征和基于微调的方法效果更佳,将词表征为向量形式,可以使词向量贴合语境。其次,运用全局注意力机制学习句子中词与词之间的信息,有选择性的关注重要信息,可满足准确识别实体的需求。但也造成了效益背反现象:
(1)BERT预训练语言中递归神经网络、自注意力机制与全局注意力机制、BiGRU的学习存在重复,使计算复杂度增加,模型训练时间长。
(2)模型网络结构复杂,增加了隐层节点,学习能力得以提升,但响应速度慢,实体识别时间长。
参照图3,示出了本申请一实施例提供的一种基于分布式***的并行化命名实体识别方法,所述方法包括:
S310、获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
S320、将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
S330、以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
S340、通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
S350、通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
在本申请的实施例中,通过获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特 征,并输出当前时间步的特征信息;通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。有效减少模型的隐层节点。与已有的BERT-BAC模型相比,在确保较高F1值(精确率和召回率的调和平均数)的情况下,该模型训练时间和实体识别时间分别缩短60.36%、39.43%,具有更广泛的实用性。
下面,将对本示例性实施例中一种基于分布式***的并行化命名实体识别方法作进一步地说明。
如上述步骤S310所述,获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
在一实施例中,可以结合下列描述进一步说明步骤S310所述“获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块”的具体过程。
如下列步骤所述,确定滑动窗口的大小,并从所述数据文本信息的起始处以预设单元步长进行滑动,直至所述滑动窗口被数据装载完毕;
如下列步骤所述,确定所述滑动窗口当前位置组成的文本块的当前Hash值;
如下列步骤所述,当所述当前Hash值大于第一预设值时,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块。
需要说明的是,所述第一预设值的取值取决于Hash算法,根据文本的特点进行设定会有更佳的效果,优选为800B。
在一实施例中,可以结合下列描述进一步说明步骤S310所述“获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块”的具体过程。
如下列步骤所述,当所述当前Hash值小于所述第二预设值时,则驱动所述滑动窗口向前滑动一个预设单元步长。
需要说明的是,所述第二预设值的取值取决于Hash算法,根据文本的 特点进行设定会有更佳的效果,优选为500B。
在一实施例中,可以结合下列描述进一步说明步骤S310所述“获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块”的具体过程。
如下列步骤所述,当所述当前Hash值的取值处于所述第一预设值和所述第二预设值之间时,则获取与所述当前Hash值最相邻的第一历史Hash值和次相邻的第二历史Hash值;
如下列步骤所述,判断所述当前Hash值与所述第一历史Hash值以及所述第二历史Hash值是否相等;
如下列步骤所述,若是,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块;若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
在一实施例中,可以结合下列描述进一步说明步骤S310所述“获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块”的具体过程。
如下列步骤所述,判断所述滑动窗口是否抵达所述数据文本信息结尾处;
如下列步骤所述,若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
作为一种示例,内容可变长度分块(CDC)算法是应用Rabin指纹将文本分割成长度大小不一的分块策略。用一个固定大小的滑动窗口来划分文本,当滑动窗口的Rabin指纹值与期望值相匹配时,在该位置划分一个分割点,重复这个过程,直至整个文本都被划分,这样文本都将按照预先设定的分割点进行划分成文本块。
对于实体识别,Hadoop的并行化效果主要体现在文本数据的分块,但实体识别对文本上下文有着较强的依赖性,使得分块算法的优劣将直接影响实体识别效果。基于此,本方法根据多个文本块同时处理的情况,在文本分 块中约束了文本块的大小,使文本块避免出现过大的文本块;同时,根据中文语句特点,优先考虑在段落结尾处进行分块。基于CDC的文本分块算法流程如图3所示,具体地:
1)读取文件。
2)设定一个特定大小w的滑动窗口,从文本的开始处以两个字节为一个单元进行滑动,直至w大小的滑动窗口被数据装载完毕。
3)用简单的加法hash计算当前滑动窗口位置组成的文本块Hash值。
4)将计算得到的Hash值与设定的Hash值X进行比较,其中,X选取500B。若Hash值≤X时,则按步骤2向前移动一个单元;若Hash值>X时,则按步骤5进行比较。
5)将计算得到的Hash与设定的Hash值Y进行比较,其中,Y选取800B。若Hash值≥Y时,则按步骤7将该位置进行分割;若Hash值<Y时,则按步骤6进行比较。
6)将计算得到的Hash值与前两次计算得到的Hash值进行比较,若Hash值与前两次计算得到的Hash值相等时,则按步骤7将该位置进行分割;若否,则按步骤2向前移动一个单元。
7)以滑动窗口的右边界作一个分割线,对文本进行分割。
8)判断滑动窗口是否抵达文件结尾处,即文本是否分块完毕。若否,按步骤2滑动窗口向下滑动一个单位;若是,则文本分块完成。
需要说明的是,上述X对应为第二预设值,Y对应为第一预设值。
通过文本分块预处理,得到n个文本块,将文本块按字节从大到小排序,当n<Map节点数时,将文本块按排序发送到具体的节点中进行Map处理,以最大字节的文本块1在Map并行处理的结束为依据进入Map阶段。但在n>Map节点数时,需要n除于Map节点数进行分组,得到组数,将文本块按各组依次按排序发送到具体的节点中分别进行Map处理,以文本块c在Map并行处理的结束为依据进入Map阶段,c=组数×节点数+1;从而解决了最后一个文本块过小使并行处理不一致导致数据紊乱的问题。
该算法将文本划分为文本块,最大程度地降低了因文本块划分导致语义 中断使输出结果不准确的发生概率,增加模型的实体识别效果。同时,分块可以减少相邻两个段落间神经网络学习中过去信息的被遗忘量的计算,提高模型的实体识别效率。
如上述步骤S320所述,将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
在一实施例中,可以结合下列描述进一步说明步骤S320所述“将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态”的具体过程。
如下列步骤所述,构建两个独立的GRU神经网络;
如下列步骤所述,对所述文本块的前向信息和后向信息进行汇总并进行合并;具体地,对应于前向的GRU神经网络通过所述前向信息和嵌入输入词生成所述正向隐藏状态,对应于后向的GRU神经网络通过所述后向信息和嵌入输入词生成所述反向隐藏状态。
如上述步骤S330所述,以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
在一实施例中,可以结合下列描述进一步说明步骤S330所述“以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息”的具体过程。
如下列步骤所述,以正向隐藏状态和反向隐藏状态为当前时间步,使用固定窗口大小的局部注意力机制确定正向上下文向量以及反向上下文向量;
如下列步骤所述,将所述正向隐藏状态、所述反向隐藏状态、所述正向上下文向量和所述反向上下文向量作为训练条件,生成所述特征信息;其中,所述特征信息包括正向特征信息和反向特征信息。
作为一种示例,采用全局注意力对序列中所有时间步上进行计算,这是由于关键点的时间步难以确定,使得不得不对所有时间步进行计算,避免忽 略了关键时间步,但这带来的计算代价十分之高。因此,通过BiGRU神经网络能够在全局上充分获得特征,而在获得局部特征时存在不足的情况,本文在各个数据块中根据两个单向的、相反的GRU输出的隐藏状态h t为时间步,运用局部注意力机制,充分得到此时间步周围的局部特征,获得句子的内部结构信息,减少了全局注意力的计算量,使输出的结果即准确又快捷。注意力机制的思想是人们在观察图像的时候,大多是根据自己的需求将注意力集中在特定部分,因此将传统局部注意力机制的范围进行优化,如图4所示。
在文本中,相邻的上下文本和左右文本一样具有较强的关联性。本文以GRU神经网络输出h t(图4的h t r)为时间步,使用固定窗口大小的局部注意力机制,窗口的长度大小为2D+1,宽度大小为2i+1,D和i为超参数,即以h t为中心到窗口边界的单方向距离,上下文向量C t的计算方法如下:
Figure PCTCN2021108313-appb-000004
其中,向量α t为计算校正向量,h i为隐藏向量。
通过公式我们可以看出,只是考虑时间步范围的区别,其他与全局注意力机制完全相同,h t为窗口的中心,直接使其等于当前时间步t,通过训练获得,即:
h t=T xσ(ν ptanh(W ph i))              (2)
其中,σ为sigmoid函数,V p和W p均为可训练参数。因此这样计算得到的h t是一个浮点数,但这并没有影响,因为计算校准权重向量α t时,增加了一个均值为h t
最后计算的输出:
Figure PCTCN2021108313-appb-000005
如上述步骤S340所述,通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
在一实施例中,可以结合下列描述进一步说明步骤S340所述“通过全 连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量”的具体过程。
如下列步骤所述,将特征信息和随机初始向量输入到预构建的单层感知机网络进行共同学习生成隐藏层表示和上下层隐含状态关联向量;
如下列步骤所述,依据所述隐藏层表示和所述上下层隐含状态关联向量确定权重向量;
如下列步骤所述,依据所述权重向量对所述特征信息进行加权求和,获得求和结果;
如下列步骤所述,通过激活函数softmax依据所述求和结果生成所述输出特征向量。
在一实施例中,可以结合下列描述进一步说明步骤S340所述“将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态”的具体过程。
通过以下公式得出所述隐藏层表示:
设第i个文本块中通过局部注意力输出的正向特征信息为
Figure PCTCN2021108313-appb-000006
和反向特征信息为
Figure PCTCN2021108313-appb-000007
Figure PCTCN2021108313-appb-000008
u ij=tanh(W αS ij+b α)
式中,i表示第i个文本块,j表示第j个时间步,u ij表示将S ij经过单层感知机网络后输出的隐藏层表示;S ij表示第j个时间步下的S i
作为一种示例,全连接层作为Hadoop架构的Reduce阶段,参照孙华东在自注意力机制和全局注意力机制模块的运算规则。首先在局部注意力机制层不用关注该数据块与其他数据块之间的联系,各数据块之间独立,通过局部注意力机制层,抽取出当前数据块的重要特征信息h i。然后对所有数据块的特征信息h i进行加权求和得到特征向量,补充文本的数据块之间的内部信息,最后连接激活函数softmax进行最终结果输出,运算公式如(4)~(6)所示
对第i个数据块中局部注意力输出的
Figure PCTCN2021108313-appb-000009
进行加权求和运算,令
Figure PCTCN2021108313-appb-000010
u ij=tanh(W αS ij+b α)               (4)
Figure PCTCN2021108313-appb-000011
Figure PCTCN2021108313-appb-000012
其中,对第i个数据块的第j个时间步,u ij表示将S ij经过单层感知机网络,得到的隐藏层表示;U α表示随机初始化的上下层隐含状态关联向量,在模型训练中会更新变化,所有时间步共享;α ij表示u ij与U α的内积,然后进行softmax归一化得到的权重向量;h i表示对S i进行加权求和得到当前数据块的特征向量。
首先,通过公式(4)得到更高层次的隐藏层表示U ij,再经过公式(5)的softmax归一化得到权重向量α ij,表示当前数据块中第j个时间步的隐藏状态权重系数,最后通过公式(6)进行加权求和,就得到了当前数据块的特征向量表示h i
各个数据块所得到的特征向量hi作为输入向量通过第二层全连接层,与上述单个数据块的特征向量计算方式(4)~(6)相同。最后连接激活函数f得到输出结果,f为softmax,即:
F(x)=softmax(W βy+b β)                  (7)
如上述步骤S350所述,通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
在一具体实现中,采用编程语言为VC++,软件开发平台为VisualStudio2015,操作***为Win10×64,CPU为IntelCorei7,主频3.91GHz,内存16GB。在模型中,词嵌入维度设为350维,GRU网络的隐层设为128维, 注意力机制层统一设置为25维,D=15,i=3;dropout层设置为0.25;数据分块预处理中设X=500B,Y=800B。在Map阶段设置10个Map节点,Reduce阶段设m=7。
实验搜集了大量的实验室检测报告,包含了农药残留及污染物检测实验室、司法鉴定实验室、药物残留及添加剂检测实验室等,涉及食品、药品、动植物等多方面,所用语料均来自深圳海关提供的检测报告,总数量8300份。检测报告的实体提取问题,与传统人名、地址,公司等7类特定实体的识别略有区别,在检测报告关键数据提取需要自己定义复杂的实体标签集,需要定义的实体有:样品编号、检验日期、样品项目、检测仪器、检测方法、进样量、报告结果共七类实体。
首先,将检测报告进行预处理,主要有去燥、分词处理和词标注3个部分,所有的标注工作均由专业培训的人员手工标注完成。标注体系实验采用的标注体系为BIO,其中,B表示实体的开始,I表示实体中除起始位置的其他部分,O表示非实体。然后随机抽取按照3:1:1的比例将其划分为训练集、开发集、测试集,检测报告实体样本分布数据如表1所示。
项目 字符个数 实体标记字符个数
训练集 168102 86249
开发集 82193 27006
测试集 84730 28351
表1
本方法采用准确率(P)、召回率(R)和F1值3种命名实体识别的通用评价指标来对所提出的方法性能进行评估。3种评价指标具体定义为:
Figure PCTCN2021108313-appb-000013
Figure PCTCN2021108313-appb-000014
Figure PCTCN2021108313-appb-000015
为了验证改进的Block-BAC模型在关键数据自动提取的优异性,以实验室检测报告为样本,在同一数据集上,统计30份检测报告的不同实体在不同模型上的P,R,F1值,对BiGRU-Attention-CRF模型(以下简称为BAC模型)、BERT-BAC模型和Block-BAC模型进行实验,表2中P,R,F1值表示为30分检测报告P,R,F1值的平均值,模型实验对比结果如表2所示。
模型 P R F1
BAC 87.31 85.49 86.39
BERT-BAC 92.05 89.13 90.57
Block-BAC 89.75 88.16 88.94
表2
从实验结果可以看出,本方法提出的Block-BAC模型相较于BAC模型,其准确率、召回率和F1值有明显的提升,F1值提升了2.55%;与BERT-BAC模型相比,F1值下降1.63%,在实体识别效果上略有不足。通过三种模型的对比,可以发现Block-BAC模型在BAC基础上准确率和召回率有了较大的提升,虽然没有BERT-BAC模型提升幅度大,仍保证了较高的F1值,实体识别效果良好。
为了测试模型能否满足需求分析的性能需求,除了实体识别效果这一重要因素外,模型训练和实体识别时长也是重要的性能分析指标。因此,随机选用实验室检测报告,通过不断增加训练样本数量来对模型的性能进行模型训练模块进行测试,通过增加识别样本数量来对模型的实体识别模块进行测试,将程序运行的时间结果分别绘制成图6和图7。
由图6和图7可以得出,模型训练时间和实体识别时间随着样本数量的增加而增加,呈正相关。三种模型的训练时间和实体识别时间,从大到小依次排列为BERT-BAC>BAC>Block-BAC。
在模型训练模块实验,样本数量达到8000份时,BERT-BAC模型的训练时间为14.72h,Block-BAC模型的训练时间为5.83。由此可见,相较于BERT-BAC模型,Block-BAC模型缩短了60.36%的训练时间。
在实体识别模块实验,样本数量达到180份时,BERT-BAC模型的实体识别时间为85.04min,Block-BAC模型的实体识别时间为51.56min,由此可 见,相较于BERT-BAC模型,Block-BAC模型缩短了39.43%的训练时间。
实验结果表明,相较于BERT-BAC模型,Block-BAC模型训练和实体识别上具有明显的时间优势,更能满足用户的性能需求。
综上,本方法提出的Block-BAC模型,针对模型训练和实体识别处理效率上的不足,首先设计了一种基于CDC的数据分块优化算法进行预处理,基于Hadoop架构实现了并行运行机制,成为缩短模型训练与实体识别时间的有效策略;然后优化了局部注意力机制,相较于全局注意力机制减少了无效的隐层节点的运用;最后连接全连接层,得到数据块的特征向量,并补足数据块之间的内在联系。与BERT-BAC模型相比,该模型模型训练时间缩短了60.36%、实体识别时间缩短了39.34%,同时兼顾了F 1值,大幅降低了效益背反的程度,更具有实用性价值,切合用户的使用需求。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
参照图8,示出了本申请一实施例提供的一种基于分布式***的并行化命名实体识别装置,
具体包括:
分块模块810,用于获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
隐藏状态获取模块820,用于将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
特征信息生成模块830,用于以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
输出特征向量生成模块840,用于通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
实体识别结果生成模块850,用于通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
在本发明一实施例中,所述分块模块810,包括:
滑动窗口生成子模块,用于确定滑动窗口的大小,并从所述数据文本信息的起始处以预设单元步长进行滑动,直至所述滑动窗口被数据装载完毕;
当前Hash值确定子模块,用于确定所述滑动窗口当前位置组成的文本块的当前Hash值;
第一文本块生成子模块,用于当所述当前Hash值大于第一预设值时,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块。
在本发明一实施例中,还包括第二预设值,其中所述第二预设值小于所述第一预设值,还包括:
第一滑动窗口驱动子模块,用于当所述当前Hash值小于所述第二预设值时,则驱动所述滑动窗口向前滑动一个预设单元步长。
进一步地,还包括:
历史Hash值获取子模块,用于当所述当前Hash值的取值处于所述第一预设值和所述第二预设值之间时,则获取与所述当前Hash值最相邻的第一历史Hash值和次相邻的第二历史Hash值;
判断子模块,用于判断所述当前Hash值与所述第一历史Hash值以及所述第二历史Hash值是否相等;
第二文本块生成子模块,用于若是,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块;第二滑动窗口驱动子模块,用于若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
进一步地,还包括:
分割位置判断子模块,用于判断所述滑动窗口是否抵达所述数据文本信息结尾处;
第三滑动窗口驱动子模块,用于若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
进一步地,所述隐藏状态获取模块820,包括:
神经网络建立子模块,用于构建两个独立的GRU神经网络;
隐藏状态生成子模块,用于对所述文本块的前向信息和后向信息进行汇总并进行合并;具体地,对应于前向的GRU神经网络通过所述前向信息和嵌入输入词生成所述正向隐藏状态,对应于后向的GRU神经网络通过所述后向信息和嵌入输入词生成所述反向隐藏状态。
进一步地,所述特征信息生成模块830,包括:
上下文向量确定子模块,用于以正向隐藏状态和反向隐藏状态为当前时间步,使用固定窗口大小的局部注意力机制确定正向上下文向量以及反向上下文向量;
特征信息生成子模块,用于将所述正向隐藏状态、所述反向隐藏状态、所述正向上下文向量和所述反向上下文向量作为训练条件,生成所述特征信息;其中,所述特征信息包括正向特征信息和反向特征信息。
进一步地,所述输出特征向量生成模块840,包括;
共同学习子模块,用于将特征信息和随机初始向量输入到预构建的单层感知机网络进行共同学习生成隐藏层表示和上下层隐含状态关联向量;
权重向量确定子模块,用于依据所述隐藏层表示和所述上下层隐含状态关联向量确定权重向量;
加权求和子模块,用于依据所述权重向量对所述特征信息进行加权求和,获得求和结果;
输出特征向量生成子模块,用于通过激活函数softmax依据所述求和结果生成所述输出特征向量。
进一步地,所述共同学习子模块,包括
通过以下公式得出所述隐藏层表示:
设第i个文本块中通过局部注意力输出的正向特征信息为
Figure PCTCN2021108313-appb-000016
和反向特征信息为
Figure PCTCN2021108313-appb-000017
Figure PCTCN2021108313-appb-000018
u ij=tanh(W αS ij+b α)
式中,i表示第i个文本块,j表示第j个时间步,u ij表示将S ij经过单层 感知机网络后输出的隐藏层表示;S ij表示第j个时间步下的S i
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种基于分布式***的并行化命名实体识别方法,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (10)

  1. 一种基于分布式***的并行化命名实体识别方法,其特征在于,所述方法包括:
    获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
    将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
    以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
    通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
    通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块的步骤,包括,
    确定滑动窗口的大小,并从所述数据文本信息的起始处以预设单元步长进行滑动,直至所述滑动窗口被数据装载完毕;
    确定所述滑动窗口当前位置组成的文本块的当前Hash值;
    当所述当前Hash值大于第一预设值时,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块。
  3. 根据权利要求2所述的方法,其特征在于,还包括第二预设值,其中所述第二预设值小于所述第一预设值,还包括步骤:
    当所述当前Hash值小于所述第二预设值时,则驱动所述滑动窗口向前滑动一个预设单元步长。
  4. 根据权利要求3所述的方法,其特征在于,还包括步骤:
    当所述当前Hash值的取值处于所述第一预设值和所述第二预设值之间 时,则获取与所述当前Hash值最相邻的第一历史Hash值和次相邻的第二历史Hash值;
    判断所述当前Hash值与所述第一历史Hash值以及所述第二历史Hash值是否相等;
    若是,则以所述滑动窗口的右边界作为分割线,对所述数据文本信息进行分割,并将分割后位于所述分割线左边所述数据文本信息部分设置为所述文本块;若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
  5. 根据权利要求2所述的方法,其特征在于,还包括步骤:
    判断所述滑动窗口是否抵达所述数据文本信息结尾处;
    若否,则驱动所述滑动窗口向前滑动一个预设单元步长。
  6. 根据权利要求2所述的方法,其特征在于,所述将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态的步骤,包括:
    构建两个独立的GRU神经网络;
    对所述文本块的前向信息和后向信息进行汇总并进行合并;具体地,对应于前向的GRU神经网络通过所述前向信息和嵌入输入词生成所述正向隐藏状态,对应于后向的GRU神经网络通过所述后向信息和嵌入输入词生成所述反向隐藏状态。
  7. 根据权利要求6所述的方法,其特征在于,所述以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息的步骤,包括:
    以正向隐藏状态和反向隐藏状态为当前时间步,使用固定窗口大小的局部注意力机制确定正向上下文向量以及反向上下文向量;
    将所述正向隐藏状态、所述反向隐藏状态、所述正向上下文向量和所述反向上下文向量作为训练条件,生成所述特征信息;其中,所述特征信息包括正向特征信息和反向特征信息。
  8. 根据权利要求7所述的方法,其特征在于,所述通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征 向量的步骤,包括;
    将特征信息和随机初始向量输入到预构建的单层感知机网络进行共同学习生成隐藏层表示和上下层隐含状态关联向量;
    依据所述隐藏层表示和所述上下层隐含状态关联向量确定权重向量;
    依据所述权重向量对所述特征信息进行加权求和,获得求和结果;
    通过激活函数softmax依据所述求和结果生成所述输出特征向量。
  9. 根据权利要求8所述的方法,其特征在于,所述将特征信息和随机初始向量输入到预构建的单层感知机网络进行共同学习生成隐藏层表示和上下层隐含状态关联向量的步骤,包括:
    通过以下公式得出所述隐藏层表示:
    设第i个文本块中通过局部注意力输出的正向特征信息为
    Figure PCTCN2021108313-appb-100001
    和反向特征信息为
    Figure PCTCN2021108313-appb-100002
    Figure PCTCN2021108313-appb-100003
    u ij=tanh(W αS ij+b α)
    式中,i表示第i个文本块,j表示第j个时间步,u ij表示将S ij经过单层感知机网络后输出的隐藏层表示;S ij表示第j个时间步下的S i
  10. 一种基于分布式***的并行化命名实体识别装置,其特征在于,具体包括:
    分块模块,用于获取数据文本信息,并依据命名实体识别规则对所述数据文本信息进行文本划分生成若干文本块;
    反向隐藏状态获取模块,用于将各个文本块分别通过嵌入层得到的词嵌入向量,输入BiGRU神经网络中挖掘该文本块的全局特征,得到该文本块t时刻的正向隐藏状态和反向隐藏状态;
    特征信息生成模块,用于以正向隐藏状态和反向隐藏状态为当前时间步,通过局部注意力机制补足文本块的局部特征,并输出当前时间步的特征信息;
    输出特征向量生成模块,用于通过全连接层对所述文本块的特征信息进行加权求和,并通过激活函数softmax生成输出特征向量;
    实体识别结果生成模块,用于通过CRF模型的正则化的极大似然估计,输出最佳序列作为实体识别结果。
PCT/CN2021/108313 2021-07-26 2021-07-26 一种基于分布式***的并行化命名实体识别方法及装置 WO2023004528A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/108313 WO2023004528A1 (zh) 2021-07-26 2021-07-26 一种基于分布式***的并行化命名实体识别方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/108313 WO2023004528A1 (zh) 2021-07-26 2021-07-26 一种基于分布式***的并行化命名实体识别方法及装置

Publications (1)

Publication Number Publication Date
WO2023004528A1 true WO2023004528A1 (zh) 2023-02-02

Family

ID=85086060

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/108313 WO2023004528A1 (zh) 2021-07-26 2021-07-26 一种基于分布式***的并行化命名实体识别方法及装置

Country Status (1)

Country Link
WO (1) WO2023004528A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108206A (zh) * 2023-04-13 2023-05-12 中南大学 一种金融数据实体关系的联合抽取方法及相关设备
CN116629264A (zh) * 2023-05-24 2023-08-22 成都信息工程大学 一种基于多个词嵌入和多头自注意力机制的关系抽取方法
CN116703128A (zh) * 2023-08-07 2023-09-05 国网信息通信产业集团有限公司 一种适用于电力调度的自然语言处理方法
CN117669574A (zh) * 2024-02-01 2024-03-08 浙江大学 基于多语义特征融合的人工智能领域实体识别方法及***

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101788976A (zh) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 一种基于内容的文件分割方法
CN109344391A (zh) * 2018-08-23 2019-02-15 昆明理工大学 基于神经网络的多特征融合中文新闻文本摘要生成方法
CN109657239A (zh) * 2018-12-12 2019-04-19 电子科技大学 基于注意力机制和语言模型学习的中文命名实体识别方法
US20190171913A1 (en) * 2017-12-04 2019-06-06 Slice Technologies, Inc. Hierarchical classification using neural networks
CN112328555A (zh) * 2020-11-25 2021-02-05 国网重庆招标有限公司 一种招标文件的快速生成方法
CN112348075A (zh) * 2020-11-02 2021-02-09 大连理工大学 一种基于情景注意力神经网络的多模态情感识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101788976A (zh) * 2010-02-10 2010-07-28 北京播思软件技术有限公司 一种基于内容的文件分割方法
US20190171913A1 (en) * 2017-12-04 2019-06-06 Slice Technologies, Inc. Hierarchical classification using neural networks
CN109344391A (zh) * 2018-08-23 2019-02-15 昆明理工大学 基于神经网络的多特征融合中文新闻文本摘要生成方法
CN109657239A (zh) * 2018-12-12 2019-04-19 电子科技大学 基于注意力机制和语言模型学习的中文命名实体识别方法
CN112348075A (zh) * 2020-11-02 2021-02-09 大连理工大学 一种基于情景注意力神经网络的多模态情感识别方法
CN112328555A (zh) * 2020-11-25 2021-02-05 国网重庆招标有限公司 一种招标文件的快速生成方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108206A (zh) * 2023-04-13 2023-05-12 中南大学 一种金融数据实体关系的联合抽取方法及相关设备
CN116629264A (zh) * 2023-05-24 2023-08-22 成都信息工程大学 一种基于多个词嵌入和多头自注意力机制的关系抽取方法
CN116629264B (zh) * 2023-05-24 2024-01-23 成都信息工程大学 一种基于多个词嵌入和多头自注意力机制的关系抽取方法
CN116703128A (zh) * 2023-08-07 2023-09-05 国网信息通信产业集团有限公司 一种适用于电力调度的自然语言处理方法
CN116703128B (zh) * 2023-08-07 2024-01-02 国网信息通信产业集团有限公司 一种适用于电力调度的自然语言处理方法
CN117669574A (zh) * 2024-02-01 2024-03-08 浙江大学 基于多语义特征融合的人工智能领域实体识别方法及***
CN117669574B (zh) * 2024-02-01 2024-05-17 浙江大学 基于多语义特征融合的人工智能领域实体识别方法及***

Similar Documents

Publication Publication Date Title
WO2023004528A1 (zh) 一种基于分布式***的并行化命名实体识别方法及装置
CN110347837B (zh) 一种心血管疾病非计划再住院风险预测方法
CN111738003B (zh) 命名实体识别模型训练方法、命名实体识别方法和介质
CN104699763B (zh) 多特征融合的文本相似性度量***
CN113591483A (zh) 一种基于序列标注的文档级事件论元抽取方法
CN108874896B (zh) 一种基于神经网络和幽默特征的幽默识别方法
Wahid et al. Topic2Labels: A framework to annotate and classify the social media data through LDA topics and deep learning models for crisis response
Zhou et al. Sentiment analysis of text based on CNN and bi-directional LSTM model
CN112052684A (zh) 电力计量的命名实体识别方法、装置、设备和存储介质
CN106202065B (zh) 一种跨语言话题检测方法及***
Sartakhti et al. Persian language model based on BiLSTM model on COVID-19 corpus
Jin et al. Inter-sentence and implicit causality extraction from chinese corpus
CN116304748A (zh) 一种文本相似度计算方法、***、设备及介质
Guan et al. Hierarchical neural network for online news popularity prediction
CN111723572B (zh) 基于CNN卷积层和BiLSTM的中文短文本相关性度量方法
CN111859955A (zh) 一种基于深度学习的舆情数据分析模型
Zhang et al. Combining the attention network and semantic representation for Chinese verb metaphor identification
Nazarizadeh et al. Using Group Deep Learning and Data Augmentation in Persian Sentiment Analysis
CN113342964B (zh) 一种基于移动业务的推荐类型确定方法及***
CN116263786A (zh) 舆情文本情感分析方法、装置、计算机设备及介质
CN115600595A (zh) 一种实体关系抽取方法、***、设备及可读存储介质
CN113051886B (zh) 一种试题查重方法、装置、存储介质及设备
CN115510230A (zh) 一种基于多维特征融合与比较增强学习机制的蒙古语情感分析方法
Dong et al. Knowledge graph construction of high-performance computing learning platform
Ji et al. Research on semantic similarity calculation methods in Chinese financial intelligent customer service

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21951147

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE