CN115641009B

CN115641009B - Method and device for excavating competitors based on patent heterogeneous information network

Info

Publication number: CN115641009B
Application number: CN202211421320.5A
Authority: CN
Inventors: 陈洁; 张雪; 赵姝; 张燕平
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-05-05
Anticipated expiration: 2042-11-14
Also published as: CN115641009A

Abstract

The application provides a method and a device for excavating competitors based on a patent heterogeneous information network, which relate to the field of network representation learning and comprise the following steps: acquiring a competition pair true value, and extracting and cleaning patent data from a patent database according to the competition pair true value to construct a patent data set; determining patent semantic similarity according to the patent data set, constructing patent semantic connected edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic connected edges; obtaining the structural characteristics of company nodes and company nodes in the patent heterogeneous information network through graph embedding; screening company nodes, and integrating vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes; and calculating cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company. According to the method, a network is constructed through patent data, a graph embedding method and a attention mechanism are introduced for competitor mining, and mining efficiency is improved.

Description

Method and device for excavating competitors based on patent heterogeneous information network

Technical Field

The application relates to the technical field of network representation learning, in particular to a method and a device for mining competitors based on patent heterogeneous information network.

Background

The patent is one of important marks for measuring the scientific and technical development level, and is the focus and core of scientific and technical competition and economic competition. Patent competitors mining is an emerging field of research in recent years aimed at helping industry, business, legal and decision groups find important competing relationships, illustrating leading business trends, thereby further motivating novel industrial solutions and making important investment decisions. In the present time of knowledge economy, intellectual property competition is one of the most unavoidable forces in the world today. Therefore, it is very interesting to mine competitors for target enterprises based on patent data.

Most of the current competitor mining methods are based on text data, and ignore structural information. A small amount of research works combine structural information with text information to achieve competitor mining, but such methods do not take into account the heterogeneity of patent data well and it is difficult to efficiently mine potential structural features between companies. At present, although the existing graph embedding method is used for mining the competition relationship among enterprises in the patent data, no work has proposed a targeted competitor mining scheme for the patent data. With existing competitor mining techniques, competitor mining effects tend to be less than expected.

Disclosure of Invention

In order to solve the above problems, a method and a device for mining competitors based on patent heterogeneous information network are provided, wherein the network is constructed according to patent data, and rich structures and semantic information of nodes in the network are learned by using different graph embedding methods, and attention mechanisms are introduced to fuse target node representations for effectively being used for the competitor mining, so that the competitor mining performance is improved.

The first aspect of the present application proposes a method for mining competitors based on patent heterogeneous information network, comprising:

acquiring a competition pair true value, and extracting and cleaning patent data from a patent database according to the competition pair true value to construct a patent data set;

determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic conjoined edges;

obtaining company nodes in the patent heterogeneous information network and the structural characteristics of the company nodes by a graph embedding method;

screening the company nodes, and synthesizing vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes;

and calculating cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.

Optionally, the extracting and cleaning patent data from the patent database to construct a patent data set includes:

and obtaining the competition pair true value by using a crawler technology, de-duplicating the patent data and filtering null value and invalid data.

Optionally, the determining the patent semantic similarity according to the patent data set, and constructing the patent semantic borderline according to the patent semantic similarity includes:

for a document content set t= { tc ₁ ,tc ₂ ,…tc _|D| Computing word w in document content tc based on word frequency-inverse text frequency index TF-IDF _i The weight of (a) refers to the following formula:

wherein tc_i Is patent p _i Is used to determine the content of the text of (a),

i D I is the total number of documents, m _w For the number of documents in which the word w appears at least once, tf (w, tc _i ) For the word w to appear in the document content tc _i The number of times in (a);

acquiring the document content tc according to the weight _i The formula is as follows:

wherein ,

is word w _i N-dimensional word vectors of (a);

calculate patent tc in T _i and tc_j Similarity of text content, the patent semantic similarity ranking is calculated according to the following formula:

p _i p＝Topn(Sim ₁ (p _i ,p _j ))，

wherein tsim (,) is a function of calculating similarity, ψ (p) _i ,p _j ) Is a binary index, lambda is patent tc _i and tc_j Top n () is used to obtain the Top n highest similarity patent lists;

constructing a patent semantic continuous edge PL= { p according to the similarity rank ₁ p,p ₂ p,...,p _|D| p}。

Optionally, the constructing a patent heterogeneous information network according to the patent semantic continuous edge includes:

for a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ), where V, E, H represent nodes, edges, weight sets, respectively, ζ is an object type mapping function, ψ is a relation type mapping function, μ is a weight type mapping function, whereby the weights of the relationships between the nodes are defined as follows:

where u is the weight attenuation value, rank _index For the index of the patent semantic similarity rank, if the relation between the nodes is a patent semantic continuous edge, the edge weight of the relation is g-u×rank _index 。

Optionally, the obtaining, by a graph embedding method, the company node in the patent heterogeneous information network and the structural feature of the company node includes:

one element path P: a is selected in a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ) ₁ →A ₂ →…→A _l+1 The meta-path walk follows the following distribution:

wherein ,n_t Traversal in random walkt nodes, v of type A _t ,

Is of the type A _t+1 A first order set of neighbors of node v.

Optionally, the structural feature of the company node satisfies a co-occurrence probability maximization formula, where the formula is expressed as:

wherein ,v_i V as the current node _j Is a context node.

Optionally, the screening the company node, integrating the vector representation of the company node through an attention mechanism, to obtain an embedding matrix of the company node, including:

transforming node embedding by nonlinear transformation, based on attention vectors

The attention value of node i in meta-path Z is obtained and formulated as follows:

ω ⁱ ＝q ^T ·tanh(W·(e ⁱ ) ^T +b)，

wherein W is a weight matrix, b is a bias value, e ⁱ Is an embedded representation;

using an activation function for the attention value omega corresponding to node i in the meta-path Z ⁱ Normalization processing is performed, and the formula is as follows:

wherein ,αⁱ An embedded importance value for the inode;

for all nodes in meta-path Z, the weight matrix is

For obtaining meta-path Z of company node through graph embedding method screening _C ,Z _C′ Randomly walking the patent heterogeneous information network to screen and obtain a meta-path Z of a company node _G The embedding matrix of the company node is as follows:

Z _A ＝α _C ·Z _C +α _C′ ·Z _C′ +α _G ·Z _G ，

wherein ,α_C For element path Z _C Learning weight matrix of all nodes in a network, alpha _C′ For element path Z _C′ Learning weight matrix of all nodes in a network, alpha _G For element path alpha _G A learning weight matrix of all nodes in the network.

Optionally, in the obtaining the embedding matrix of the corporate node, the loss function is:

wherein ,E⁺ E is the total node pair, t is the hyper-parameter.

Optionally, the cosine similarity between the company nodes is calculated according to the embedding matrix by the following formula:

wherein ,c_i ,c _j Is a corporate node in the embedded matrix.

A second aspect of the present application proposes an apparatus for mining competitors based on patent heterogeneous information networks, comprising:

the acquisition module acquires a competition pair true value, and extracts and cleans patent data from a patent database according to the competition pair true value to construct a patent data set;

the construction module is used for determining patent semantic similarity according to the patent data set, constructing patent semantic continuous edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic continuous edges;

the embedding module is used for acquiring company nodes in the patent heterogeneous information network and the structural characteristics of the company nodes through a graph embedding method;

the optimization module screens the company nodes, synthesizes vector representation of the company nodes through an attention mechanism, and obtains an embedding matrix of the company nodes;

and the competitor mining module calculates cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

by constructing a network according to patent data, a patent heterogeneous information network with a patent-company center is constructed on a semantic connection edge for excavating competitors, rich structures and semantic information of nodes in the network are learned by using different graph embedding methods, and a attention mechanism is introduced to fuse target node representations for effectively excavating competitors, so that the excavating performance of competitors is improved, and the effectiveness is improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart illustrating a method of mining competitors based on proprietary heterogeneous information networks according to an exemplary embodiment of the present application;

FIG. 2 is a block diagram illustrating a competitor device based on proprietary heterogeneous information networks according to an exemplary embodiment of the present application;

fig. 3 is a block diagram of an electronic device.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.

Fig. 1 is a flowchart illustrating a method for mining competitors based on proprietary heterogeneous information networks, according to an exemplary embodiment of the present application, including:

step 101, obtaining a competition pair true value, and extracting and cleaning patent data from the patent database according to the competition pair true value to construct a patent data set.

In the embodiment of the application, the crawler technology is used for acquiring the competition pair true value, and the patent database comprises a Chinese-English database.

For the chinese dataset, company name information related to patent infringement cases is collected from the national referee document network, and there is a competitive relationship between companies related to patent infringement disputes, for example: company a and company B create a dispute of infringing on patent rights of the invention, and then company a and company B have a competitive relationship, and company name information is sorted into a competitor list for one competition pair (one row of the competitor list).

In one possible embodiment, the patent infringement titles for month 2021, 07 to month 2017, 01 are collected and the competitor list is divided into two sub-lists with month 2020 as a time line.

For English data sets, e.g. according to Yahoo-! The method comprises the steps of collecting company name lists of all industries of a company in a financial website, taking the company name lists as a website query list to obtain a competitor list of a target company, extracting nearly hundred thousand patents from a patent database according to the competitor list, and controlling the number of the patents owned by the company to be about 20 to form a Uspto-Yahoo data set in order to ensure data balance and experimental results.

The cleaning of the data set includes extracting data from the patent database according to the competition pair truth value list, and filtering out the data if one or both companies cannot find the patent data in the patent database. And then, de-duplicating the patent data, and filtering null values and invalid data.

In one possible embodiment, the competitor of "the center communication stock limited" is "digital tech company in the united states".

Step 102, determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic conjoined edges.

In the embodiment of the application, the patent title, abstract and claim specification are specifically obtained from the patent data set to determine the semantic similarity of the patent, and meanwhile, company, patent, inventor and field are associated and a patent heterogeneous information network centering on the company and the patent is constructed.

Specifically, for the document content collection t= { tc ₁ ,tc ₂ ,…tc _|D| Computing word w in document content tc based on word frequency-inverse text frequency index TF-IDF _i The weight of (a) refers to the following formula:

i D I is the total number of documents, m _w The number of documents, tf (w, tc) for the word w to appear at least once _i ) Appears in the document content tc for word w _i The number of times in (a);

acquiring document content tc according to the weight _i The formula is as follows:

wherein ,

is word w _i N-dimensional word vector of (a) each patent text feature vector +.>

Also N-dimensional.

Calculate patent tc in T _i and tc_j Similarity of text content, patent semantic similarity ranking is calculated according to the following formula:

p _i p＝Topn(Sim ₁ (p _i ,p _j ))，

wherein tsim (,) is a function of calculating similarity, ψ (p) _i ,p _j ) Is a binary index, lambda is patent tc _i and tc_j Is used to obtain the top n highest similarity patent lists.

Psi (p) _i ,p _j ) =1, otherwise ψ (p _i ,p _j )＝0，λ∈[0,1]The degree of correlation of the two terms in the formula is expressed.

In a possible embodiment, λ=0.2 is set.

Constructing patent semantic continuous edges PL= { p according to similarity ranking ₁ p,p ₂ p,...,p _|D| p}。

After obtaining the patent semantic edge, constructing a patent heterogeneous information network centering on companies and patents, and specifically describing the following:

for a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ), where V, E, H represent nodes, edges, weight sets, respectively, ζ is an object type mapping function, ψ is a relation type mapping function, μ is a weight type mapping function, whereby weights of the relation between company nodes are defined as follows:

/>

where u is the weight attenuation value, rank _index For the index of patent semantic similarity ranking, if the relation between nodes is a patent semantic continuous edge, the edge weight of the relation is g-u×rank _index 。

In a possible embodiment, g=0.1 is set and u=g/n.

And step 103, acquiring company nodes and structural features of the company nodes in the patent heterogeneous information network through a graph embedding method.

In the embodiment of the application, a meta-path P: a is selected in a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ) ₁ →A ₂ →…→A _l+1 Meta-path walk follows the following distribution:

wherein ,n_t The type of v is A for the t node traversed in the random walk _t ,

Is of the type A _t+1 A first order set of neighbors of node v.

In this embodiment, the patent node is denoted by symbol P, the company node is denoted by symbol C, the inventor node is denoted by symbol I, the domain node is denoted by symbol F, and given the meta-paths CPC and CPIPC, two sample node sequences are generated, for example, starting from Apple company node: (1) Apple (Apple) _company →P1 _patent →Qualcomm _company And (2) Apple _Company →P2 _patent →Mary _inventor →P3 _patent →Huawei _company Likewise, given a meta-path CFC, another sequence of nodes may be generated: apple (Apple) _Company →G06K _CPC →Huawei _company 。

The structural characteristics of the company nodes meet the co-occurrence probability maximization formula, and the formula is expressed as follows:

wherein ,v_i V as the current node _j Is a context node.

In a possible embodiment, the number num=10, the stride length l=10, the window size w=5, and the negative sampling n=5 of the network embedded random walk paths are set.

Step 104, screening the company nodes, and integrating vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes.

In the embodiment of the application, node embedding is converted through nonlinear transformation, and the node embedding is carried out according to the attention vector

ω ⁱ ＝q ^T ·tanh(W·(e ⁱ ) ^T +b)，

using an activation function as the attention value ω corresponding to node i in the meta-path Z ⁱ Normalization processing is performed, and the formula is as follows:

wherein ,αⁱ For the embedded importance value of the inode,

the larger the value of (c) indicates the higher the embedding importance of its corresponding node.

For all nodes in meta-path Z, the weight matrix is

For obtaining meta-path Z of company node through graph embedding method screening _C ,Z _C′ Random walk patent heterogeneous information network screening to obtain meta-path Z of company node _G The embedding matrix of the company node is:

Z _A ＝α _C ·Z _C +α _C′ ·Z _C′ +α _G ·Z _G ，

wherein ,α_C For element path Z _C Learning of all nodes in a networkWeight matrix, alpha _C′ For element path Z _C′ Learning weight matrix of all nodes in a network, alpha _G For element path alpha _G A learning weight matrix of all nodes in the network.

In addition, in obtaining the embedding matrix of the company node, the loss function is:

wherein ,E⁺ E is the total node pair, t is the hyper-parameter.

In a possible embodiment, the semantic link n=3, the embedding dimension d=400, the learning rate 0.0001, and the maximum iteration number epoch=110 are set.

And 105, calculating cosine similarity among company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.

In the embodiment of the application, the cosine similarity among company nodes is calculated according to the embedded matrix by the following formula:

wherein ,c_i ,c _j Is a corporate node embedded in the matrix.

Specifically, k most likely competitors are recommended to the target company, and k is a parameter specifying the number of recommended competitors.

In a possible embodiment, k=3, 10, 20, 30 is set.

According to the embodiment of the application, the network is constructed according to the patent data, the patent heterogeneous information network centering on the patent-company is constructed on the semantic links for the excavation of competitors, the rich structure and semantic information of the nodes in the network are learned by using different graph embedding methods, and the attention mechanism is introduced to fuse the target node representation for effectively being used for the excavation of competitors, so that the excavation performance of competitors is improved, and the effectiveness is improved.

In addition, the patent data set is a Chinese-English data set and can be suitable for different language scenes.

Fig. 2 is a flow chart of an apparatus 200 for mining competitors based on proprietary heterogeneous information networks, according to an exemplary embodiment of the present application, comprising an acquisition module 210, a construction module 220, an embedding module 230, an optimization module 240, and a competitor mining module 250.

The acquisition module 210 acquires a competition pair true value, and extracts and cleans patent data from the patent database according to the competition pair true value to construct a patent data set;

the construction module 220 is used for determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity and constructing a patent heterogeneous information network according to the patent semantic conjoined edges;

the embedding module 230 acquires company nodes and structural features of the company nodes in the patent heterogeneous information network through a graph embedding method;

the optimizing module 240 screens the company nodes, synthesizes vector representation of the company nodes through the attention mechanism, and obtains an embedding matrix of the company nodes;

the competitor mining module 250 calculates cosine similarity between company nodes according to the embedded matrix, wherein the result with the highest cosine similarity is taken as a candidate competitor of the target company.

The specific manner in which the operations of the various methods are performed in relation to the methods of the embodiments described above have been described in detail in relation to the embodiments of the system and will not be described in detail herein.

FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the apparatus 300 includes a computing unit 301 that may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 may also be stored. The computing unit 301, the ROM 302, and the RAM 303 are connected to each other by a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, etc.; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, an optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 301 performs the respective methods and processes described above, such as a voice instruction response method. For example, in some embodiments, the voice instruction response method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 300 via the ROM 302 and/or the communication unit 309. When the computer program is loaded into RAM 303 and executed by computing unit 301, one or more steps of the voice instruction response method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the voice instruction response method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for mining competitors based on patent heterogeneous information network, comprising:

2. The method of claim 1, wherein the extracting and cleansing patent data from the patent database to construct a patent data set comprises:

3. The method of claim 1, wherein the determining patent semantic similarity from the patent dataset, and constructing patent semantic edges from the patent semantic similarity, comprises:

wherein ,

is word w _i N-dimensional word vectors of (a);

p _i p＝Topn(Sim ₁ (p _i ,p _j ))，

wherein tsim (,) is a function of calculating similarity, ψ (p) _i ,p _j ) Is a binary index, lambda is patent tc _i and tc_j For obtaining the first n highest similarity patent lists;

4. A method according to claim 3, wherein said constructing a patent heterogeneous information network from said patent semantic links comprises:

5. The method according to claim 1, wherein the obtaining, by a graph embedding method, the company node in the patent heterogeneous information network and the structural feature of the company node, includes:

Is of the type A _t+1 A first order set of neighbors of node v.

6. The method of claim 1, wherein the structural feature in which the corporate node is located satisfies a co-occurrence probability maximization formula, the formula expressed as:

wherein ,v_i V as the current node _j Is a context node.

7. The method of claim 1, wherein said filtering said company nodes, synthesizing vector representations of said company nodes by an attention mechanism, to obtain an embedding matrix of said company nodes, comprises:

ω ⁱ ＝q ^T ·tanh(W·e ⁱ ) ^T +b)，

wherein ,αⁱ An embedded importance value for the inode;

for all nodes in meta-path Z, the weight matrix is

For obtaining meta-path Z of company node through graph embedding method screening _C ，Z _C′ Randomly walking the patent heterogeneous information network to screen and obtain a meta-path Z of a company node _G The embedding matrix of the company node is as follows:

Z _A ＝α _C ·Z _C +α _C′ ·Z _C′ +α _G ·Z _G ，

8. The method of claim 7, wherein in said obtaining said embedding matrix of corporate nodes, a loss function is:

wherein ,E⁺ E is the total node pair, t is the hyper-parameter.

9. The method of claim 1, wherein the cosine similarity between the company nodes is calculated from the embedding matrix by the formula:

wherein ,c_i ,c _j Is a corporate node in the embedded matrix.

10. An apparatus for mining competitors based on patent heterogeneous information networks, comprising: