CN113033208A - Government affair text data part-of-speech tagging-based enterprise owner matching method - Google Patents

Government affair text data part-of-speech tagging-based enterprise owner matching method Download PDF

Info

Publication number
CN113033208A
CN113033208A CN202110431789.6A CN202110431789A CN113033208A CN 113033208 A CN113033208 A CN 113033208A CN 202110431789 A CN202110431789 A CN 202110431789A CN 113033208 A CN113033208 A CN 113033208A
Authority
CN
China
Prior art keywords
enterprise
matching
word
name
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110431789.6A
Other languages
Chinese (zh)
Inventor
张聪
吴地龙
吴天飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Non Line Digital Technology Co ltd
Original Assignee
Zhejiang Non Line Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Non Line Digital Technology Co ltd filed Critical Zhejiang Non Line Digital Technology Co ltd
Priority to CN202110431789.6A priority Critical patent/CN113033208A/en
Publication of CN113033208A publication Critical patent/CN113033208A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a business owner matching method based on part-of-speech tagging of government affair text data, which comprises the following steps: the method comprises the steps of extracting enterprise names in government affair text data, extracting enterprise naming patterns according to known enterprise naming rules, matching the enterprise naming patterns, and determining the matching result of an enterprise main body in the text according to the matching result of the enterprise naming patterns.

Description

Government affair text data part-of-speech tagging-based enterprise owner matching method
Technical Field
The invention relates to the technical field of computer application, in particular to a business owner matching method based on part-of-speech tagging of government affair text data.
Background
With the continuous promotion of national informatization construction, data resource sharing and integration work is carried out in a plurality of regions, however, for government departments, the current situation that a plurality of systems work together and data sharing is carried out in a complex interaction mode still exists, the current situation is easy to cause the problem that data of other systems are not updated due to untimely data updating, and when a certain system is stopped, enterprise information is the core content of the systems, but because the enterprise information has a plurality of attributes and needs to be changed, such as enterprise names, the information change times of one enterprise name are multiple, the scene of manual error entry is possibly caused in the change process, and the problem of other functions surrounding the enterprise is easily caused by the condition of wrong enterprise name identification; however, most of the prior art match the names of the enterprises by adopting a manual checking mode, so that the manual checking of matching of two text enterprise bodies is time-consuming and labor-consuming work in the government affair big data processing, and how to efficiently solve the problem, and release of human resources is one of the problems faced by the government affair big data.
Disclosure of Invention
The technical problem to be solved by the invention is that aiming at the problem that time and labor are consumed and wasted in the existing government affair big data processing mainly by manually checking the matching of two-party text enterprise main bodies, the enterprise name is extracted from the government affair text data, the enterprise naming mode is extracted according to the known enterprise naming rule, the enterprise naming mode is matched, and the matching result of the enterprise main body in the text is determined according to the matching result of the enterprise naming mode.
In order to solve the above problems, the technical solutions provided by the embodiments of the present invention are as follows:
in a first aspect of the embodiments of the present invention, a method for matching business owners based on part-of-speech tagging of government affair text data is provided, the method including the following steps:
acquiring a text to be identified;
inputting the text to be recognized into an enterprise entity recognition module, and taking the output of the enterprise entity recognition module as an enterprise name main vocabulary corresponding to the text to be recognized;
inputting the main words and phrases of the enterprise name into a mode extraction module, and taking the output of the mode extraction module as the words to be matched of the enterprise name corresponding to the text to be recognized; the enterprise name word to be matched is composed of the following three parts: city information representing geographical positions, name information of enterprises, industry information of the enterprises and enterprise property information; the method comprises the steps that a word indicating city information of a geographic position is appointed to be a first word to be matched, a word indicating name information of an enterprise and industry information of the enterprise is appointed to be a second word to be matched, and a word indicating enterprise property information is appointed to be a third word to be matched;
constructing an own enterprise name library: inputting the own enterprise name text into the pattern extraction module to obtain an enterprise name matching word, wherein the enterprise name matching word comprises the following three parts: city information representing geographical positions, name information of enterprises, industry information of the enterprises and enterprise property information; the method comprises the steps that a word indicating city information of a geographic position is appointed to be a first matching word, a word indicating name information of an enterprise and industry information of the enterprise is appointed to be a second matching word, and a word indicating property information of the enterprise is appointed to be a third matching word;
for any one of the pre-constructed optional enterprise names in the self enterprise name library, matching the words to be matched of the enterprise names with the enterprise name matching words of the optional enterprise names;
comparing the first word to be matched with the first matching word, if the first word to be matched is missing or the first word to be matched and the first matching word are successfully matched, sequentially calculating the matching score of the second word to be matched and the second matching word, calculating the matching score of the third word to be matched and the third matching word, and finally obtaining the comprehensive score of pattern matching; if the first matching word to be matched is not successfully matched with the first matching word, the alternative enterprise name is not successfully matched; and selecting the matching item with the composite score larger than the threshold value and the highest composite score as output, and determining the alternative enterprise name with the highest composite score as the standard enterprise name.
In one possible implementation, the method further includes: the concrete identification steps of the enterprise entity identification module are as follows: for the text to be recognized, firstly, adopting a word with the property of the rightmost boundary of the enterprise entity name, and sequentially searching all rightmost boundary words of the enterprise entity name, wherein the searching process is appointed to be a first searching; then, searching out the leftmost boundaries corresponding to all the rightmost boundaries by adopting a geographic information word bank, wherein the searching process is appointed to be a second searching; and determining the characters between the leftmost boundary and the rightmost boundary as the main vocabulary of the enterprise name.
In one possible implementation, the method further includes: the term of the property of the rightmost boundary of the business entity name is member company, limited company, studio.
In one possible implementation, the method further includes: and the first search and the second search are both searched by adopting a Deterministic Finite Automaton (DFA).
In one possible implementation, the method further includes: and constructing a tree data model by adopting a reverse order mode for the words with the rightmost boundary property of the enterprise entity names, and reversely searching the text to be recognized for the rightmost boundary words of all the enterprise entity names during searching.
In one possible implementation, the method further includes: and if the first search is successful and the second search is failed, outputting characters from the initial character to the rightmost character of the text to be recognized as a main word of the enterprise name.
In one possible implementation, the method further includes: and if the first search fails and the second search also fails, outputting the text to be recognized.
In one possible implementation, the method further includes: and for the text output by the enterprise entity identification module, performing part-of-speech tagging on the intercepted character string by adopting a part-of-speech tagging function of a jieba word, selecting a word part of the city information with the part-of-speech being a geographic position as a first word to be matched according to the annotation of the part-of-speech in a jieba part-of-speech tagging tool, and sequentially selecting words of two parts-of-speech containing name information of the enterprise, industry information of the enterprise and property information of the enterprise as a second word to be matched and a third word to be matched respectively.
In one possible implementation, the method further includes: calculating the matching score of the second word to be matched and the second matching word by using a longest public subsequence algorithm; and calculating the matching scores of the third word to be matched and the third matching word by using a longest common subsequence algorithm.
For the method, the invention is realized by the following steps:
(1) naming the custom enterprise entity: the standard enterprise naming in the process of extracting main words of enterprise names by adopting an enterprise entity recognition module is composed of three parts, namely city information (ns) representing the geographic position, name information of an enterprise, industry information (n) where the enterprise is located and enterprise property information (n); the standard name rule of the enterprise can be understood to conform to the ns + nn type of the mode, and according to the mode characteristic, the user-defined method determines the leftmost and rightmost boundaries of the enterprise entity to extract main words of the enterprise name;
(2) acquiring a to-be-matched term of the enterprise name from the government affair text by adopting an ns + nn mode extraction module: extracting a matching mode (ns + nn) according to the self word stock characteristic adopted during recognition for the main words of the enterprise name obtained by the enterprise entity recognition module in the step (1); for the text which cannot be directly extracted from the text to the main words of the enterprise names, part-of-speech tagging is directly carried out by using a jieba word segmentation tool, and one or more groups of ns + nn matching modes in the text are obtained by starting with the words with the geographic property characteristics:
(3) matching enterprise main bodies: matching the enterprise main bodies by adopting an enterprise main body matching module, completely comparing the ns + nn matching modes obtained in the step (2) according to the urban information (ns) sequence part representing the geographic position, and if the ns sequence part is missing or the matching is successful, directly performing matching calculation on the part, containing the name information of the enterprise and the industry information of the enterprise, in the nn sequence to obtain the comprehensive score of the mode matching, wherein the part with the highest score is regarded as the successful matching of the enterprise, and the process is ended; and if the ns sequence is partially matched unsuccessfully, the enterprise matching is unsuccessful, and the process is finished.
Wherein, the specific steps in the step (1) are as follows:
step 1): and identifying the company entity from the text by adopting a custom named entity identification method: firstly, reversely and sequentially searching all rightmost boundary words of the enterprise name according to a word library in a pre-defined enterprise name field, such as words with the rightmost boundary property of the enterprise name, such as 'company', 'limited company', and the like; and searching the leftmost boundaries corresponding to all the rightmost boundaries according to a pre-defined geographic information word bank. And if the extraction is successful, outputting the extracted main vocabulary of the enterprise name, and marking the main vocabulary as Ture. Otherwise, marking the text as the Flase, and outputting the original text or the text after the first successful matching and screening.
Wherein, the specific steps in the step (2) are as follows:
step 2): taking the output in the step 1) as the input of the module, and dividing the input into 3 parts; if the mark in the step 1) is Ture, the words successfully matched for the second time are used as ns sequences of the patterns, the words successfully matched for the first time are used as second n sequences of the patterns, and the characters before the words and the patterns are used as first n sequences of the patterns; if the step 1) is marked as False, performing part-of-speech tagging on the output in the step 1) by using a part-of-speech tagging function of the jieba participle. According to the annotation of parts of speech in the part of speech tagging tool of jieba, a word part with the part of speech of "ns" is selected as an ns part of a pattern, and two words with the part of speech of "n" are sequentially selected as nn sequences of the pattern.
Wherein, the specific steps in the step (3) are as follows:
step 3): based on the ns + nn pattern sequence obtained in the step (2), firstly matching an ns sequence part, wherein ns has definite position information, if the matching result does not correspond to the ns sequence part, the enterprise entities described in the two texts are different enterprises, ending the program at the moment, and returning False; if the matching is successful or the pattern ns sequence is lost, switching to the step 4) to perform matching calculation;
step 4): the nn sequence containing the name information of the enterprise itself calculates the matching score by applying the algorithm (LCS) of the longest public subsequence to the part of the sequence;
step 5): the nn sequence containing enterprise property information is matched and scored according to LCS algorithm;
and weighting the matching scores obtained in the step 4) and the step 5) to obtain the final comprehensive score of the nn sequence. Comparing the comprehensive score with a predefined threshold value, wherein all comprehensive scores lower than the threshold value indicate that the enterprise name matching is unsuccessful, ending the program, and returning to the Flase; and if the score is higher than the threshold value and the score is highest, the enterprise name is successfully matched, the program is ended, and the successfully matched enterprise name is returned.
In a second aspect of the embodiments of the present invention, there is provided an enterprise agent matching apparatus, including: the system comprises a first acquisition unit, a second acquisition unit and a recognition unit, wherein the first acquisition unit is used for acquiring a text to be recognized, and the text to be recognized at least comprises an enterprise name;
the second acquisition unit is used for acquiring an enterprise name main vocabulary corresponding to the text to be recognized from the text to be recognized;
the third obtaining unit is used for obtaining the enterprise name to-be-matched words corresponding to the text to be recognized from the text to be recognized;
the enterprise name library unit is used for converting the own enterprise name text into an enterprise name matching word;
the matching unit is used for matching the enterprise name to-be-matched word with the enterprise name matching word of the alternative enterprise name for any alternative enterprise name in the enterprise name library unit and calculating the comprehensive score of the alternative enterprise name;
and the determining unit selects the matching item with the comprehensive score larger than the threshold value and the highest comprehensive score as output, and determines the alternative enterprise name with the highest comprehensive score as the standard enterprise name.
Compared with the prior art, the invention has the following advantages: the invention solves the problems of time and labor consumption and low efficiency caused by manually checking the matching of two text enterprise bodies in the government affair big data processing, and designs to directly extract a custom mode (ns + nn) which is in accordance with the name style of an enterprise name from a text or extract the custom mode (ns + nn) from an enterprise entity, wherein the enterprise name needs to be formed by four parts of a province/city/county administrative planning name, a word number or a business number, an industry or business characteristic and an organization form when the enterprise is considered to register the enterprise name to a business bureau, and according to the characteristics, a custom enterprise entity extraction method is specially provided. Pre-constructing word bank structures related to organization forms, then pre-constructing administrative planning word banks of provinces, cities and counties where enterprises are located, and determining the rightmost boundary and the leftmost boundary of the enterprise name through the two word banks so as to extract main words of the enterprise name; if the main words of the enterprise names meeting the requirements cannot be directly extracted from the text, the main words of the enterprise names meeting the requirements are obtained by marking and grouping the text based on the semantic analysis of the jieba participles, meanwhile, different matching modes are carried out on self-defined mode sequence segments with different meanings, self-defined ns sequence segments containing geographic properties are directly matched, and the nn sequence segments are subjected to score matching by a longest common subsequence algorithm (LCS).
Drawings
FIG. 1 is a flow chart of the ns + nn mode extraction method of the present invention;
FIG. 2 is a flow chart of a longest common subsequence matching method in accordance with the present invention;
FIG. 3 is a schematic diagram of a pattern extraction process in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a pattern matching process in an embodiment of the present invention;
fig. 5 is a schematic diagram of an enterprise agent matching device in an embodiment of the invention.
Detailed Description
The present invention will be further described with reference to fig. 1-4 and the following detailed description.
A business owner matching method based on part-of-speech tagging of government affair text data comprises the following steps:
(1) naming the custom enterprise entity: extracting main words of the enterprise name by adopting an enterprise entity recognition module; based on the own enterprise property field word stock and the geographic word stock, determining the leftmost boundary and the rightmost boundary of the enterprise name, and thus obtaining the main words of the enterprise name in the text.
(2) Acquiring a to-be-matched term of the enterprise name from the government affair text by adopting an ns + nn mode extraction module: extracting a matching mode (ns + nn) according to the self word stock characteristic adopted during recognition for the main words of the enterprise name obtained by the enterprise entity recognition module in the step (1); for a text which cannot be directly extracted from the text to a main word of an enterprise name, part-of-speech tagging is directly carried out by using a jieba word segmentation tool, one or more groups of ns + nn matching modes in the text are obtained by beginning with a word with geographic property characteristics, and qualifiers such as 'province' and 'city' are added during the extraction of the ns part: pattern extraction is shown in fig. 1;
(3) matching enterprise main bodies: adopting an enterprise main body matching module to match the enterprise main body: each piece of data to be matched in the database is processed into three parts for storage in the modes of the step (1) and the step (2), and specific words such as 'province', 'city' and the like are omitted when the ns part is stored. Comparing the ns + nn matching patterns obtained from the text according to the urban information (ns) sequence part representing the geographic position, if the ns sequence part is missing or the matching is successful, directly performing matching calculation on the part containing the name information of the enterprise and the property information of the enterprise in the nn sequence to obtain the comprehensive score of pattern matching, wherein the comprehensive score is greater than a threshold value, and the highest score is a matching success item, and ending the process; if the ns sequence is partially matched unsuccessfully, the enterprise matching is unsuccessful, and the process is finished; as shown in fig. 2
Wherein, the specific steps in the step (1) are as follows:
step 1): firstly, reversely searching the input text for the rightmost boundary words of all enterprise names according to a word bank in the field of pre-defined enterprise names. In order to improve the searching accuracy, the words in the word stock are constructed into a tree-shaped data model in a reverse order mode. Therefore, reverse search is adopted during searching;
the enterprise name field data thesaurus is shown in table 1:
word and phrase
Department and official duty limit
Department of public Limit share
Department and public limitation are
Department official business
Working in room
Then, searching out the left boundary of the enterprise name according to a pre-defined geographic information word bank, and at the moment, adopting forward search, wherein the word bank data is shown in a table 2:
word and phrase
Zhejiang river
Hangzhou Zhejiang province
Zhejiang lake zhou
Wenzhou Zhejiang province
If the two searches are successful, outputting and extracting the enterprise name, marking the name as Ture and indicating that the enterprise entity is successfully extracted; the other label is False, indicating that the business entity was not successfully extracted. If the matching is successful for the first time and the matching is unsuccessful for the second time, outputting text characters starting from the original text to the rightmost boundary; outputting the original text if the two searches are unsuccessful; thus, this step may output zero, one, or more sets of text.
And the two searching and matching processes are matched by adopting a Deterministic Finite Automaton (DFA).
The DFA matching procedure is as follows:
a. constructing a tree structure of words according to the word stock; assuming that the thesaurus is table 2, the tree structure is constructed as follows: { 'is _ end': Fals, 'Jiang' { 'is _ end': True, 'Hangzhou': is _ end ': Fals,' State ': is _ end': True } }, 'lake': is _ end ': Fals,' state ': is _ end': True } }, 'temperature': is _ end ': False,' state ': is _ end': True }.
b. Setting a matching mode as maximum matching; assuming that the original text is 'Hangzhou Zhejiang is a place suitable for enterprise development', the word most matched according to the qualifier is 'Hangzhou Zhejiang' rather than 'Zhejiang'.
c. The output matches to the word and the bits in the original text.
Wherein, the specific steps in the step (2) are as follows:
step 2): taking the output in the step 1) as the input of the module, and dividing the input into 3 parts; if the word marked as Ture in the step 1) is successfully matched for the first time in the self-defined named entity recognition method, the word successfully matched for the second time has enterprise property characteristics, the text between the two successfully matched words can be regarded as having enterprise self information characteristics, and the output of the last step is extracted according to an ns + nn mode; if the step 1) is marked as False, the part-of-speech tagging function of the jieba participle is used for tagging the intercepted character strings. According to the annotation of parts of speech in the part of speech tagging tool of jieba, a word part with the part of speech of "ns" is selected as an ns part of a mode, and words containing characters of "n" in two parts of speech are sequentially selected as nn parts of the mode respectively. To ensure that the patterns are extracted as much as possible, this step is to extract all possible sets of patterns in the text.
Wherein, the specific steps in the step (3) are as follows:
the results of the self-owned business name text preprocessing are shown in table 3:
name of an enterprise ns (geographical information) nn (self information) nn (nature)
Hangzhou Ling Warm-Ventilation engineering Co Ltd (Hangzhou) Slush net heating and ventilation project Limited Co.
Yuyao City Mengde sanitary ware Co Ltd (Yuyao) Sanitary ware with dream Limited Co.
Ningbo city Jinyang electric appliance Co Ltd Ningbo State Electric appliance for golden sheep Limited Co.
Hangzhou Yixin investment consulting Co Ltd (Hangzhou) Yixin investment counseling Limited Co.
Step 3): and sequentially processing all the mode groups in the step 2), and carrying out the following operation on any group of modes, wherein the ns part in the mode is obtained firstly, and the ns part of all the self enterprise names after the preprocessing is completely compared. If the ns comparison is successful, turning to the step 4); if the part ns is extracted to be empty, the step 4) is also changed; if the comparison is not consistent, the comparison of the current round is ended.
Step 4): the nn sequence containing the name information of the enterprise itself calculates the matching score by applying the algorithm (LCS) of the longest public subsequence to the part of the sequence;
the LCS calculated match score is described as follows:
let us assume that the character string a is x1x2, …, xm,b ═ y1y2, …, yn. String Z is the longest common subsequence of A and B, | Z | represents the length of string Z. Then score
Figure BDA0003031689330000091
W1 in the above formula represents the weight of step 4).
Step 5): nn sequences containing enterprise property information, and matching scores are carried out on the part of sequences according to LCS; at this time, the matching score sore (n2) is calculated in the same manner as described above, and the weight is w 2.
And weighting the matching scores obtained in the step 4) and the step 5) to obtain the final comprehensive score of the nn sequence. Considering the characteristics of the matched text in the steps 4) and 5), different weight attributes are set for the two steps, and w1 is set to be 0.65, and w2 is set to be 0.35 according to the sequence that the self-information of the enterprise is greater than the property of the enterprise. The final composite score is calculated as sore (n1) + sore (n 2).
Step 6): and 5) sorting all the comprehensive scores matched with the data in the table 3 in the mode, selecting the enterprise name with the highest score and larger than the threshold value 0.5 as the final output, and otherwise, finishing.
The practical use of the invention is as follows: as shown in fig. 3-4, when the input text is: "Shaoxing county travel vehicle service Co., Ltd. (Zhe XXX large common passenger car); the output result of XXX (Zhe XXX general two-wheeled motorcycle) "is shown in fig. 3, and it can be seen from fig. 3 that: matching the rightmost boundary 'limited company' of the input text from right to left according to the word stock in an enterprise entity recognition module; similarly, according to the geographic word stock, the leftmost boundary 'Shaoxing' is matched from the input text to the 'limited company', and the name of the enterprise, namely 'Shaoxing county tourism automobile service limited company', is obtained. In the mode extraction module, the modes 'NS: Shaoxing', 'N: foreign travel automobile service' and 'N: Limited company' are obtained according to the step 2). And in the enterprise subject matching module, sequentially matching the pattern with the data in the table 3 to calculate the score.
Assuming that the single data matched with the mode is NS-Shaoxing, N-foreign travel service and N-Limited company, the step 3) of NS partial comparison is carried out, and the step 4) is carried out after the comparison is successful. As an example, in step 4), if | a | ═ 8, | B | ═ 6, | Z | ═ 4, | Z | ═ 0.4 ═ w1 ═ 0.26, then. Similarly, step 5) sore (n2) ═ 1 × w2 ═ 0.35, and the final score sore is 0.61. And 6) sorting all the scores, selecting the enterprise name with the highest score and the score larger than the threshold value of 0.5 as output, and finally returning the enterprise name of 'Shaoxing foreign travel service company'.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
It is to be understood that, in the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural. A
It is further noted that, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
In summary, the preferred embodiments of the present invention are shown and described, and some modifications of the embodiments that may occur to those skilled in the art will embody the principles of the present invention and shall fall within the technical scope of the present invention.

Claims (10)

1. A method for matching business owners based on part-of-speech tagging of government affair text data is characterized by comprising the following steps:
acquiring a text to be identified;
inputting the text to be recognized into an enterprise entity recognition module, and taking the output of the enterprise entity recognition module as an enterprise name main vocabulary corresponding to the text to be recognized;
inputting the main words and phrases of the enterprise name into a mode extraction module, and taking the output of the mode extraction module as the words to be matched of the enterprise name corresponding to the text to be recognized; the enterprise name word to be matched is composed of the following three parts: city information representing geographical positions, name information of enterprises, industry information of the enterprises and enterprise property information; the method comprises the steps that a word indicating city information of a geographic position is appointed to be a first word to be matched, a word indicating name information of an enterprise and industry information of the enterprise is appointed to be a second word to be matched, and a word indicating enterprise property information is appointed to be a third word to be matched;
constructing an own enterprise name library: inputting the own enterprise name text into the pattern extraction module to obtain an enterprise name matching word, wherein the enterprise name matching word comprises the following three parts: city information representing geographical positions, name information of enterprises, industry information of the enterprises and enterprise property information; the method comprises the steps that a word indicating city information of a geographic position is appointed to be a first matching word, a word indicating name information of an enterprise and industry information of the enterprise is appointed to be a second matching word, and a word indicating property information of the enterprise is appointed to be a third matching word;
for any one of the pre-constructed optional enterprise names in the self enterprise name library, matching the words to be matched of the enterprise names with the enterprise name matching words of the optional enterprise names;
comparing the first word to be matched with the first matching word, if the first word to be matched is missing or the first word to be matched and the first matching word are successfully matched, sequentially calculating the matching score of the second word to be matched and the second matching word, calculating the matching score of the third word to be matched and the third matching word, and finally obtaining the comprehensive score of pattern matching; if the first matching word to be matched is not successfully matched with the first matching word, the alternative enterprise name is not successfully matched; and selecting the matching item with the composite score larger than the threshold value and the highest composite score as output, and determining the alternative enterprise name with the highest composite score as the standard enterprise name.
2. The method for matching business entities based on part-of-speech tagging of government affairs text data according to claim 1, wherein the business entity identifying module specifically identifies the business entities by:
for the text to be recognized, firstly, adopting a word with the property of the rightmost boundary of the enterprise entity name, and sequentially searching all rightmost boundary words of the enterprise entity name, wherein the searching process is appointed to be a first searching;
then, searching out the leftmost boundaries corresponding to all the rightmost boundaries by adopting a geographic information word bank, wherein the searching process is appointed to be a second searching;
and determining the characters between the leftmost boundary and the rightmost boundary as the main vocabulary of the enterprise name.
3. The method of matching business entities based on part-of-speech tagging of government text data according to claim 2, wherein the word of the rightmost marginal nature of business entity name is member of companies, corporations, studios.
4. The method for matching business owners based on part-of-speech tagging of government affairs text data according to claim 2, wherein the first search and the second search are both performed by using a deterministic finite automata algorithm.
5. The method for matching business entities based on part-of-speech tagging of government affairs text data according to claim 3, wherein the tree data model is constructed by using words with the property of the rightmost boundary of business entity names in a reverse order, and the right-most boundary words of all business entity names existing in the text to be recognized are reversely searched during searching.
6. A business entity matching method based on part-of-speech tagging of government affairs text data according to claim 2, wherein if the first search is successful and the second search is failed, characters from a start character to a character at a rightmost boundary of the text to be recognized are output as a business name subject vocabulary.
7. The method of matching business owners based on part-of-speech tagging of government affairs text data according to claim 2, wherein if the first search fails and the second search also fails, the text to be recognized is output.
8. The business entity matching method based on the part-of-speech tagging of the government affair text data according to claim 6 or 7, wherein the part-of-speech tagging function of the jieba participle is adopted for the text output by the business entity recognition module to perform the part-of-speech tagging on the intercepted character string, according to the part-of-speech annotation in the jieba part-of-speech tagging tool, a word part of the city information with the part-of-speech being the geographic position is selected as a first word to be matched, and words of two parts-of-speech including the name information of the business itself, the business information where the business is located and the business property information are sequentially selected as a second word to be matched and a third word to be matched.
9. The business entity matching method based on the part-of-speech tagging of government affairs text data according to claim 1, wherein the matching score of the second word to be matched and the second matching word is calculated by using a longest common subsequence algorithm; and calculating the matching scores of the third word to be matched and the third matching word by using a longest common subsequence algorithm.
10. The enterprise subject matching device is characterized by comprising a first acquisition unit, a second acquisition unit and a matching unit, wherein the first acquisition unit is used for acquiring a text to be recognized, and the text to be recognized at least comprises an enterprise name;
the second acquisition unit is used for acquiring an enterprise name main vocabulary corresponding to the text to be recognized from the text to be recognized;
the third obtaining unit is used for obtaining the enterprise name to-be-matched words corresponding to the text to be recognized from the text to be recognized;
the enterprise name library unit is used for converting the own enterprise name text into an enterprise name matching word;
the matching unit is used for matching the enterprise name to-be-matched word with the enterprise name matching word of the alternative enterprise name for any alternative enterprise name in the enterprise name library unit and calculating the comprehensive score of the alternative enterprise name;
and the determining unit selects the matching item with the comprehensive score larger than the threshold value and the highest comprehensive score as output, and determines the alternative enterprise name with the highest comprehensive score as the standard enterprise name.
CN202110431789.6A 2021-04-21 2021-04-21 Government affair text data part-of-speech tagging-based enterprise owner matching method Pending CN113033208A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110431789.6A CN113033208A (en) 2021-04-21 2021-04-21 Government affair text data part-of-speech tagging-based enterprise owner matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110431789.6A CN113033208A (en) 2021-04-21 2021-04-21 Government affair text data part-of-speech tagging-based enterprise owner matching method

Publications (1)

Publication Number Publication Date
CN113033208A true CN113033208A (en) 2021-06-25

Family

ID=76457203

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110431789.6A Pending CN113033208A (en) 2021-04-21 2021-04-21 Government affair text data part-of-speech tagging-based enterprise owner matching method

Country Status (1)

Country Link
CN (1) CN113033208A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114547404A (en) * 2022-01-10 2022-05-27 普瑞纯证医疗科技(苏州)有限公司 Big data platform

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045847A (en) * 2015-07-01 2015-11-11 广州市万隆证券咨询顾问有限公司 Method for extracting Chinese institutional unit name from text information
CN110413764A (en) * 2019-06-18 2019-11-05 杭州熊猫智云企业服务有限公司 Long text enterprise name recognizer based on built in advance dictionary
CN111008265A (en) * 2019-12-03 2020-04-14 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN111783467A (en) * 2020-07-21 2020-10-16 致诚阿福技术发展(北京)有限公司 Enterprise name identification method and device
CN111783460A (en) * 2020-06-15 2020-10-16 苏宁金融科技(南京)有限公司 Enterprise abbreviation extraction method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045847A (en) * 2015-07-01 2015-11-11 广州市万隆证券咨询顾问有限公司 Method for extracting Chinese institutional unit name from text information
CN110413764A (en) * 2019-06-18 2019-11-05 杭州熊猫智云企业服务有限公司 Long text enterprise name recognizer based on built in advance dictionary
CN111008265A (en) * 2019-12-03 2020-04-14 腾讯云计算(北京)有限责任公司 Enterprise information searching method and device
CN111783460A (en) * 2020-06-15 2020-10-16 苏宁金融科技(南京)有限公司 Enterprise abbreviation extraction method and device, computer equipment and storage medium
CN111783467A (en) * 2020-07-21 2020-10-16 致诚阿福技术发展(北京)有限公司 Enterprise name identification method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114547404A (en) * 2022-01-10 2022-05-27 普瑞纯证医疗科技(苏州)有限公司 Big data platform

Similar Documents

Publication Publication Date Title
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
CN109543178B (en) Method and system for constructing judicial text label system
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN109726298B (en) Knowledge graph construction method, system, terminal and medium suitable for scientific and technical literature
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN111723575A (en) Method, device, electronic equipment and medium for recognizing text
US9575947B2 (en) System and method of automatically mapping a given annotator to an aggregate of given annotators
CN112199512B (en) Scientific and technological service-oriented case map construction method, device, equipment and storage medium
CN112966079B (en) Event portrait oriented text analysis method for dialog system
CA2882280A1 (en) System and method for matching data using probabilistic modeling techniques
CN111723569A (en) Event extraction method and device and computer readable storage medium
CN112163424A (en) Data labeling method, device, equipment and medium
CN112256845A (en) Intention recognition method, device, electronic equipment and computer readable storage medium
CN111078832A (en) Auxiliary response method and system for intelligent customer service
CN112100324A (en) Knowledge graph automatic check iteration method based on greedy entity link
CN111078893A (en) Method for efficiently acquiring and identifying linguistic data for dialog meaning graph in large scale
CN115687563A (en) Interpretable intelligent judgment method and device, electronic equipment and storage medium
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN112183102A (en) Named entity identification method based on attention mechanism and graph attention network
CN115795056A (en) Method, server and storage medium for constructing knowledge graph by unstructured information
CN114625748A (en) SQL query statement generation method and device, electronic equipment and readable storage medium
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN113033208A (en) Government affair text data part-of-speech tagging-based enterprise owner matching method
CN113468890B (en) Sedimentology literature mining method based on NLP information extraction and part-of-speech rules
CN113918720A (en) Training method, device and equipment of text classification model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210625

RJ01 Rejection of invention patent application after publication