CN112883161A

CN112883161A - Transliteration name recognition rule generation method, transliteration name recognition rule generation device, transliteration name recognition rule generation equipment and storage medium

Info

Publication number: CN112883161A
Application number: CN202110242748.2A
Authority: CN
Inventors: 聂镭; 齐凯杰; 聂颖
Original assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Current assignee: Longma Zhixin Zhuhai Hengqin Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-01

Abstract

The embodiment of the application is suitable for the field of natural language processing, and provides a method for generating a transliteration name recognition rule, which comprises the following steps: acquiring a text sample to be processed, wherein the text sample to be processed is marked with a first transliteration name text; extracting a first transliteration name text; and generating a transliteration name recognition rule according to the first transliteration name text, wherein the transliteration name recognition rule is used for recognizing a second transliteration name text in the text to be processed. Therefore, the transliteration name recognition rule is generated through training sample data, the names of foreigners containing transliterations in the Chinese text are recognized according to the transliteration name recognition rule, and the recognition accuracy rate of the named entity recognition of the Chinese at the present stage is improved.

Description

Transliteration name recognition rule generation method, transliteration name recognition rule generation device, transliteration name recognition rule generation equipment and storage medium

Technical Field

The present application relates to the field of natural language processing, and in particular, to a method and an apparatus for generating transliterated name recognition rules, a generation device, and a storage medium.

Background

The Chinese named entity recognition mainly realizes the entity recognition of the types of names, place names, organizations and the like at the present stage, wherein the recognition of the names has relatively mature means for the Chinese name recognition result,

but the recognition results are relatively poor when the chinese text contains the transliterated foreign person name. The foreign names of the transliterated names have variable lengths and no obvious boundary marker words, so that the foreign names containing the transliterated names cannot be identified in the current stage of the named entity identification of the Chinese, and the problem of low identification accuracy is caused.

Disclosure of Invention

In view of this, the embodiment of the present application provides a method, an apparatus, a device and a storage medium for generating a transliterated name recognition rule, so as to solve the problem of low recognition accuracy due to the fact that a foreign person name including a transliteration cannot be recognized in the present stage of the named entity recognition of chinese.

A first aspect of an embodiment of the present application provides a method for generating a transliteration name recognition rule, including:

acquiring a text sample to be processed, wherein the text sample to be processed is marked with a first transliteration name text;

extracting the first transliteration name text;

and generating the transliteration name recognition rule according to the first transliteration name text, wherein the transliteration name recognition rule is used for recognizing a second transliteration name text in the text to be processed.

In a possible implementation manner of the first aspect, the generating the transliteration name recognition rule according to the transliteration name text includes:

constructing a transliteration name database according to the first transliteration name text;

determining the frequency corresponding to each word in the first transliteration name text in the transliteration name database;

and generating a transliteration name recognition rule according to the frequency corresponding to each word in the first transliteration name text.

In a possible implementation manner of the first aspect, constructing a transliteration name recognition rule according to a frequency corresponding to each word in the first transliteration name text includes:

obtaining a frequency threshold according to the frequency corresponding to each word in the first transliteration name text;

the transliteration name recognition rule is obtained according to the following formula:

，

where i =1, 2.,. n, min (pi) denotes the minimum frequency of words in the phrase of the text to be processed, P_{Threshold value}And representing a frequency threshold, and when the minimum frequency of the words in the phrases of the text to be processed is greater than the frequency threshold, determining that the phrases in the text to be processed are second transliterated name texts.

In a possible implementation manner of the first aspect, before extracting the first transliteration name text, the method further includes:

dividing the text sample to be processed into a first text sample and a second text sample, wherein the first text sample contains a name of a person, and the second text sample does not contain the name of the person;

inputting the first text sample into a preset classification model for training to obtain a name recognition model;

the person name recognition model is used as a pre-screening rule corresponding to the transliteration name recognition rule, and before the transliteration name recognition rule recognizes a second transliteration name text in the text to be processed, the person name in the text to be processed is screened out.

In a possible implementation manner of the first aspect, after generating the transliteration name recognition rule according to the first transliteration name text, the method further includes:

extracting left adjacent characters and right adjacent characters of the first transliteration name from the text sample to be processed;

forming a prefix corpus according to the left adjacent characters;

forming a suffix corpus according to the right adjacent characters;

and generating a check rule corresponding to the transliteration name identification rule based on the prefix corpus and the suffix corpus, wherein the check rule is used for checking a second transliteration name text in the text to be processed after the transliteration name identification mechanism identifies the second transliteration name text.

A second aspect of the embodiments of the present application provides an apparatus for generating a transliteration name recognition rule, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a text sample to be processed, and the text sample to be processed is marked with a first transliteration name text;

the extraction module is used for extracting the first transliteration name text;

and the generating module is used for generating the transliteration name recognition rule according to the first transliteration name text, and the transliteration name recognition rule is used for recognizing a second transliteration name text in the text to be processed.

In one possible implementation, the generating module includes:

the construction unit is used for constructing a transliteration name database according to the first transliteration name text;

the determining unit is used for determining the frequency corresponding to each word in the first transliteration name text in the transliteration name database;

and the generating unit is used for generating a transliteration name recognition rule according to the frequency corresponding to each word in the first transliteration name text.

In one possible implementation, the generating unit includes:

the calculation subunit is used for obtaining a frequency threshold value according to the frequency corresponding to each word in the first transliteration name text;

a rule generating subunit, configured to obtain the transliteration name identification rule according to the following formula:

，

In one possible implementation, the apparatus further includes:

the dividing module is used for dividing the text sample to be processed into a first text sample and a second text sample, wherein the first text sample contains a name of a person, and the second text sample does not contain the name of the person;

the training module is used for inputting the first text sample into a preset classification model for training to obtain a name recognition model;

In one possible implementation, the apparatus further includes:

the extraction module is used for extracting left adjacent characters and right adjacent characters of the first transliteration name from the text sample to be processed;

the first forming submodule is used for forming a prefix corpus according to the left adjacent characters;

a second forming submodule for forming a suffix corpus from the right glyph;

and the verification generation submodule is used for generating a verification rule corresponding to the transliteration name identification rule based on the prefix corpus and the suffix corpus, and the verification rule is used for verifying a second transliteration name text in the text to be processed after the transliteration name identification mechanism identifies the second transliteration name text.

A third aspect of an embodiment of the present application provides a generation apparatus, including: a memory, a processor, an image capture device, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method as described above when executing the computer program.

A fourth aspect of embodiments of the present application provides a readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the method as described above.

Compared with the prior art, the embodiment of the application has the advantages that:

according to the method and the device, the transliteration name recognition rule is generated through training sample data, so that the names of foreigners containing transliterations in the Chinese text are recognized according to the transliteration name recognition rule, and the recognition accuracy rate of the named entity recognition of Chinese at the present stage is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a method for generating a transliteration name recognition rule according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a transliteration name recognition rule generating device according to an embodiment of the present application;

fig. 3 is a schematic diagram of a generating device provided in an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Referring to fig. 1, a schematic flow chart of a transliteration name recognition rule provided in an embodiment of the present application is applied to a generating device, where the generating device includes a server or a terminal device, the server may be a computing device such as a cloud server, and the terminal device may be a computing device such as a desktop computer, a notebook computer, and a palm computer, and the method includes the following steps:

and step S101, obtaining a text sample to be processed.

The text sample to be processed is marked with a first transliteration name text.

It can be understood that, according to the transliteration name type to be identified, the embodiment of the present application collects corresponding corpora, where the corpora include a large number of text samples to be processed, and each text sample to be processed includes a labeled transliteration name text.

And S102, extracting a first transliteration name text.

It can be understood that the transliteration name text marked out in the text sample to be processed is directly extracted.

Step S103, generating a transliteration name recognition rule according to the first transliteration name text,

and the transliteration name identification rule is used for identifying a second transliteration name text in the text to be processed.

In a specific application, generating a transliteration name recognition rule according to a first transliteration name text comprises the following steps:

firstly, a transliteration name database is constructed according to the first transliteration name text.

And secondly, determining the frequency corresponding to each character in the first transliteration name text in the transliteration name database.

And thirdly, establishing a transliteration name recognition rule according to the frequency corresponding to each word in the first transliteration name text.

It is understood that a word library of common names is constructed from the transliterated name corpus in step S101. Firstly, removing special characters in a transliteration name; for example, if the English transliteration name has a word representing the title, such as 'or' -and the like, and the Burma transliteration name has a word representing the title, such as 'Du' or 'Wu', special characters need to be removed according to the definition in different transliteration names; secondly, counting the characters appearing in the transliteration name; then, counting common words in the transliteration name, such as 'Mike', 'Michael' and the like; and finally, forming a word stock of common names by the counted words and words.

Names of different countries have different naming rules, for example, the names of Burma people are only names without surnames, the names have at least one character, and the names can be as long as six or seven characters, but before the Burma people title, women will add 'du', 'ma' and the like, men will add 'Guo', 'Wu' and the like, and the social status according to the ages needs to be modified appropriately, so the rules need to be considered when the Burma people name is identified; when the english name is transliterated, a connector such as ". or" - "may appear in the middle of the name, so that the influence of the special character needs to be considered when the english name is transliterated and identified. According to the name transliteration characteristics of different countries, corresponding identification rules are established, although the purpose of full identification cannot be achieved, the rules can identify partial effective names, and the error rate is reduced.

Illustratively, constructing a transliteration name recognition rule according to the frequency corresponding to each word in the first transliteration name text comprises the following steps:

1. obtaining a frequency threshold according to the frequency corresponding to each word in the first transliteration name text;

2. the transliteration name recognition rule is obtained according to the following formula:

，

wherein i =1, 2.. once, n, min (pi) represents the minimum frequency of words in the text to be processed, P_{Threshold value}And representing a frequency threshold, and when the minimum frequency of the words in the phrases of the text to be processed is greater than the frequency threshold, determining that the phrases are second transliteration name texts.

It can be understood that, in the embodiment of the present application, the recognition of the transliteration name is implemented by using the degree of solidification of the phrase, and when the minimum frequency of the words in the phrase of the text to be processed is greater than the frequency threshold, it indicates that the phrase is the transliteration name text.

In an optional implementation manner, before extracting the first transliteration name text, the method further includes:

dividing a text sample to be processed into a first text sample and a second text sample, wherein the first text sample contains a name of a person, and the second text sample does not contain the name of the person;

and secondly, inputting the first text sample into a preset classification model for training to obtain a name recognition model.

It is understood that the text sample to be processed can be divided into two types including a person name and a non-person name according to the corpus. The name recognition model is mainly used for reducing the ambiguity or the problem of false extraction of the transliterated names in the sentences, for example, the transliterated name ' bell ' represents the name of a person in the sentence ' bell has said ' xxx ', and does not represent the name of the person in the sentence ' bell city is a beautiful city ', so whether the words represent the name of the person and the context of the sentences is large. A classification model is introduced into the project, whether the sentence contains the name of a person or not is judged, and then further transliteration name recognition is carried out.

Common classification models include an SVM, a maximum entropy model, a Bayesian model and the like, which can be adopted. Establishing a corresponding classification model whether the name of the person is contained or not according to the corresponding corpus;

in another alternative embodiment, after generating the transliteration name recognition rule based on the first transliteration name text, the method further includes:

firstly, extracting left adjacent characters and right adjacent characters of a first transliteration name from a text sample to be processed;

and secondly, forming a prefix corpus according to the left adjacent characters.

And thirdly, forming a suffix corpus according to the right adjacent characters.

It can be understood that the corresponding transliteration name is obtained by manual collection according to the data in the corpus, and the corresponding words or characters before and after the transliteration name are obtained according to the word segmentation; taking the English transliteration name as an example, "Ohio State Long Mike Germany, law enforcement agencies continue to worry about possible violent incidents in the next few days, and will continue to maintain the state party building alert level. "wherein" Mike. Dewen "is the name of a person, the word preceding it is" State Long ", and the word following it is" say ". Collecting a large number of transliteration names according to the corpus data to form a transliteration name corpus; forming a prefix language database according to the words or characters before the transliteration name; and forming a suffix corpus according to the words or characters after the transliteration name.

And fourthly, generating a check rule corresponding to the transliteration name recognition rule based on the prefix corpus and the suffix corpus, wherein the check rule is used for checking the second transliteration name text after the transliteration name recognition mechanism recognizes the second transliteration name text in the text to be processed.

It will be appreciated that the set of left and right neighbors of the transliteration name to be determined may be calculated from the prefix corpus and the suffix corpus. The part of speech of the prefix name of the transliterated name to be determined can be nouns, verbs and the like, but certainly not adverbs; the suffix part of the transliterated name to be determined may be a verb, an assistant, etc. The data in the corpus is analyzed for part of speech based on the prefix corpus and the suffix corpus, and the analysis should be done simultaneously in step1 since the part of speech needs to be combined with the context. And judging whether the prefix words and the suffix words of the transliteration name identified by the current sentence meet the requirements or not according to the data parts of speech in the prefix pre-material library and the suffix corpus, judging the transliteration name when the conditions are met, and judging the transliteration name not when the conditions are not met.

In the embodiment of the application, the transliteration name recognition rule is generated through training sample data, so that the names of foreigners containing transliterations in the Chinese text are recognized according to the transliteration name recognition rule, and the recognition accuracy rate of the named entity recognition of Chinese at the present stage is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The following describes a transliteration name recognition rule generation apparatus provided in an embodiment of the present application.

The transliteration name recognition rule generation device of the present embodiment corresponds to the transliteration name recognition rule generation method described above.

Fig. 2 is a schematic structural diagram of an apparatus for generating a transliteration name recognition rule according to an embodiment of the present application, where the apparatus may be specifically integrated in a generating device, and the apparatus may include:

the obtaining module 21 is configured to obtain a text sample to be processed, where the text sample to be processed is marked with a first transliteration name text;

an extracting module 22, configured to extract the first transliteration name text;

the generating module 23 is configured to generate the transliteration name recognition rule according to the first transliteration name text, where the transliteration name recognition rule is used to recognize a second transliteration name text in the text to be processed.

In one possible implementation, the generating module includes:

In one possible implementation, the generating unit includes:

，

In one possible implementation, the apparatus further includes:

a second forming submodule for forming a suffix corpus from the right glyph;

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

Fig. 3 is a schematic diagram of a generating device 3 provided in an embodiment of the present application. As shown in fig. 3, the generation device 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32 stored in said memory 31 and executable on said processor 30. The steps in the above-described method embodiments are implemented when the computer program 32 is executed by the processor 30. Alternatively, the processor 30 implements the functions of the modules/units in the above-described device embodiments when executing the computer program 32.

Illustratively, the computer program 32 may be partitioned into one or more modules/units that are stored in the memory 31 and executed by the processor 30 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 32 in the generating device 3.

The generating device 3 may be a computing device such as a desktop computer, a notebook, a palm computer, and a cloud server. The generating device 3 may include, but is not limited to, a processor 30, a memory 31. It will be appreciated by those skilled in the art that fig. 3 is only an example of the generating device 3, and does not constitute a limitation of the generating device 3, and may comprise more or less components than those shown, or some components may be combined, or different components, for example, the generating device 3 may further comprise an input-output device, a network access device, a bus, etc.

The Processor 30 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may be an internal storage unit of the generating device 3, such as a hard disk or a memory of the generating device 3. The memory 31 may also be an external storage device of the generating device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the generating device 3. Further, the memory 31 may also include both an internal storage unit of the generating device 3 and an external storage device. The memory 31 is used for storing the computer program and other programs and data required by the generating device 3. The memory 31 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other ways. For example, the above-described terminal device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical function division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a readable storage medium, where the readable storage medium can be a computer readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above can be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method for generating a transliterated name recognition rule is characterized by comprising the following steps:

extracting the first transliteration name text;

2. The method for generating a transliteration name recognition rule according to claim 1, wherein generating the transliteration name recognition rule based on the transliteration name text comprises:

3. The method for generating a transliteration name recognition rule according to claim 2, wherein constructing a transliteration name recognition rule according to the frequency corresponding to each word in the first transliteration name text comprises:

，

4. The method for generating a transliteration name recognition rule according to claim 1, wherein before extracting the first transliteration name text, the method further comprises:

5. The transliteration name recognition rule generating method of claim 1, wherein after generating the transliteration name recognition rule based on the first transliteration name text, further comprising:

extracting left adjacent characters and right adjacent characters of the first transliteration name from a text sample to be processed;

forming a prefix corpus according to the left adjacent characters;

forming a suffix corpus according to the right adjacent characters;

and generating a check rule corresponding to the transliteration name identification rule based on the prefix corpus and the suffix corpus, wherein the check rule is used for checking the second transliteration name text after the transliteration name identification mechanism identifies the second transliteration name text in the text to be processed.

6. A method for generating a transliterated name recognition rule is characterized by comprising the following steps:

7. Generating device comprising a memory, a processor, an image pick-up means and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the computer program.

8. Storage medium, storing a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method according to any of the claims 1 to 5.