CN115641009B - Method and device for excavating competitors based on patent heterogeneous information network - Google Patents

Method and device for excavating competitors based on patent heterogeneous information network Download PDF

Info

Publication number
CN115641009B
CN115641009B CN202211421320.5A CN202211421320A CN115641009B CN 115641009 B CN115641009 B CN 115641009B CN 202211421320 A CN202211421320 A CN 202211421320A CN 115641009 B CN115641009 B CN 115641009B
Authority
CN
China
Prior art keywords
company
nodes
node
semantic
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211421320.5A
Other languages
Chinese (zh)
Other versions
CN115641009A (en
Inventor
陈洁
张雪
赵姝
张燕平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202211421320.5A priority Critical patent/CN115641009B/en
Publication of CN115641009A publication Critical patent/CN115641009A/en
Application granted granted Critical
Publication of CN115641009B publication Critical patent/CN115641009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for excavating competitors based on a patent heterogeneous information network, which relate to the field of network representation learning and comprise the following steps: acquiring a competition pair true value, and extracting and cleaning patent data from a patent database according to the competition pair true value to construct a patent data set; determining patent semantic similarity according to the patent data set, constructing patent semantic connected edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic connected edges; obtaining the structural characteristics of company nodes and company nodes in the patent heterogeneous information network through graph embedding; screening company nodes, and integrating vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes; and calculating cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company. According to the method, a network is constructed through patent data, a graph embedding method and a attention mechanism are introduced for competitor mining, and mining efficiency is improved.

Description

Method and device for excavating competitors based on patent heterogeneous information network
Technical Field
The application relates to the technical field of network representation learning, in particular to a method and a device for mining competitors based on patent heterogeneous information network.
Background
The patent is one of important marks for measuring the scientific and technical development level, and is the focus and core of scientific and technical competition and economic competition. Patent competitors mining is an emerging field of research in recent years aimed at helping industry, business, legal and decision groups find important competing relationships, illustrating leading business trends, thereby further motivating novel industrial solutions and making important investment decisions. In the present time of knowledge economy, intellectual property competition is one of the most unavoidable forces in the world today. Therefore, it is very interesting to mine competitors for target enterprises based on patent data.
Most of the current competitor mining methods are based on text data, and ignore structural information. A small amount of research works combine structural information with text information to achieve competitor mining, but such methods do not take into account the heterogeneity of patent data well and it is difficult to efficiently mine potential structural features between companies. At present, although the existing graph embedding method is used for mining the competition relationship among enterprises in the patent data, no work has proposed a targeted competitor mining scheme for the patent data. With existing competitor mining techniques, competitor mining effects tend to be less than expected.
Disclosure of Invention
In order to solve the above problems, a method and a device for mining competitors based on patent heterogeneous information network are provided, wherein the network is constructed according to patent data, and rich structures and semantic information of nodes in the network are learned by using different graph embedding methods, and attention mechanisms are introduced to fuse target node representations for effectively being used for the competitor mining, so that the competitor mining performance is improved.
The first aspect of the present application proposes a method for mining competitors based on patent heterogeneous information network, comprising:
acquiring a competition pair true value, and extracting and cleaning patent data from a patent database according to the competition pair true value to construct a patent data set;
determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic conjoined edges;
obtaining company nodes in the patent heterogeneous information network and the structural characteristics of the company nodes by a graph embedding method;
screening the company nodes, and synthesizing vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes;
and calculating cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.
Optionally, the extracting and cleaning patent data from the patent database to construct a patent data set includes:
and obtaining the competition pair true value by using a crawler technology, de-duplicating the patent data and filtering null value and invalid data.
Optionally, the determining the patent semantic similarity according to the patent data set, and constructing the patent semantic borderline according to the patent semantic similarity includes:
for a document content set t= { tc 1 ,tc 2 ,…tc |D| Computing word w in document content tc based on word frequency-inverse text frequency index TF-IDF i The weight of (a) refers to the following formula:
Figure SMS_1
wherein tci Is patent p i Is used to determine the content of the text of (a),
Figure SMS_2
i D I is the total number of documents, m w For the number of documents in which the word w appears at least once, tf (w, tc i ) For the word w to appear in the document content tc i The number of times in (a);
acquiring the document content tc according to the weight i The formula is as follows:
Figure SMS_3
wherein ,
Figure SMS_4
is word w i N-dimensional word vectors of (a);
calculate patent tc in T i and tcj Similarity of text content, the patent semantic similarity ranking is calculated according to the following formula:
Figure SMS_5
p i p=Topn(Sim 1 (p i ,p j )),
wherein tsim (,) is a function of calculating similarity, ψ (p) i ,p j ) Is a binary index, lambda is patent tc i and tcj Top n () is used to obtain the Top n highest similarity patent lists;
constructing a patent semantic continuous edge PL= { p according to the similarity rank 1 p,p 2 p,...,p |D| p}。
Optionally, the constructing a patent heterogeneous information network according to the patent semantic continuous edge includes:
for a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ), where V, E, H represent nodes, edges, weight sets, respectively, ζ is an object type mapping function, ψ is a relation type mapping function, μ is a weight type mapping function, whereby the weights of the relationships between the nodes are defined as follows:
Figure SMS_6
where u is the weight attenuation value, rank index For the index of the patent semantic similarity rank, if the relation between the nodes is a patent semantic continuous edge, the edge weight of the relation is g-u×rank index
Optionally, the obtaining, by a graph embedding method, the company node in the patent heterogeneous information network and the structural feature of the company node includes:
one element path P: a is selected in a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ) 1 →A 2 →…→A l+1 The meta-path walk follows the following distribution:
Figure SMS_7
wherein ,nt Traversal in random walkt nodes, v of type A t ,
Figure SMS_8
Is of the type A t+1 A first order set of neighbors of node v.
Optionally, the structural feature of the company node satisfies a co-occurrence probability maximization formula, where the formula is expressed as:
Figure SMS_9
wherein ,vi V as the current node j Is a context node.
Optionally, the screening the company node, integrating the vector representation of the company node through an attention mechanism, to obtain an embedding matrix of the company node, including:
transforming node embedding by nonlinear transformation, based on attention vectors
Figure SMS_10
The attention value of node i in meta-path Z is obtained and formulated as follows:
ω i =q T ·tanh(W·(e i ) T +b),
wherein W is a weight matrix, b is a bias value, e i Is an embedded representation;
using an activation function for the attention value omega corresponding to node i in the meta-path Z i Normalization processing is performed, and the formula is as follows:
Figure SMS_11
wherein ,αi An embedded importance value for the inode;
for all nodes in meta-path Z, the weight matrix is
Figure SMS_12
For obtaining meta-path Z of company node through graph embedding method screening C ,Z C′ Randomly walking the patent heterogeneous information network to screen and obtain a meta-path Z of a company node G The embedding matrix of the company node is as follows:
Z A =α C ·Z CC′ ·Z C′G ·Z G
wherein ,αC For element path Z C Learning weight matrix of all nodes in a network, alpha C′ For element path Z C′ Learning weight matrix of all nodes in a network, alpha G For element path alpha G A learning weight matrix of all nodes in the network.
Optionally, in the obtaining the embedding matrix of the corporate node, the loss function is:
Figure SMS_13
wherein ,E+ E is the total node pair, t is the hyper-parameter.
Optionally, the cosine similarity between the company nodes is calculated according to the embedding matrix by the following formula:
Figure SMS_14
wherein ,ci ,c j Is a corporate node in the embedded matrix.
A second aspect of the present application proposes an apparatus for mining competitors based on patent heterogeneous information networks, comprising:
the acquisition module acquires a competition pair true value, and extracts and cleans patent data from a patent database according to the competition pair true value to construct a patent data set;
the construction module is used for determining patent semantic similarity according to the patent data set, constructing patent semantic continuous edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic continuous edges;
the embedding module is used for acquiring company nodes in the patent heterogeneous information network and the structural characteristics of the company nodes through a graph embedding method;
the optimization module screens the company nodes, synthesizes vector representation of the company nodes through an attention mechanism, and obtains an embedding matrix of the company nodes;
and the competitor mining module calculates cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
by constructing a network according to patent data, a patent heterogeneous information network with a patent-company center is constructed on a semantic connection edge for excavating competitors, rich structures and semantic information of nodes in the network are learned by using different graph embedding methods, and a attention mechanism is introduced to fuse target node representations for effectively excavating competitors, so that the excavating performance of competitors is improved, and the effectiveness is improved.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart illustrating a method of mining competitors based on proprietary heterogeneous information networks according to an exemplary embodiment of the present application;
FIG. 2 is a block diagram illustrating a competitor device based on proprietary heterogeneous information networks according to an exemplary embodiment of the present application;
fig. 3 is a block diagram of an electronic device.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present application and are not to be construed as limiting the present application.
Fig. 1 is a flowchart illustrating a method for mining competitors based on proprietary heterogeneous information networks, according to an exemplary embodiment of the present application, including:
step 101, obtaining a competition pair true value, and extracting and cleaning patent data from the patent database according to the competition pair true value to construct a patent data set.
In the embodiment of the application, the crawler technology is used for acquiring the competition pair true value, and the patent database comprises a Chinese-English database.
For the chinese dataset, company name information related to patent infringement cases is collected from the national referee document network, and there is a competitive relationship between companies related to patent infringement disputes, for example: company a and company B create a dispute of infringing on patent rights of the invention, and then company a and company B have a competitive relationship, and company name information is sorted into a competitor list for one competition pair (one row of the competitor list).
In one possible embodiment, the patent infringement titles for month 2021, 07 to month 2017, 01 are collected and the competitor list is divided into two sub-lists with month 2020 as a time line.
For English data sets, e.g. according to Yahoo-! The method comprises the steps of collecting company name lists of all industries of a company in a financial website, taking the company name lists as a website query list to obtain a competitor list of a target company, extracting nearly hundred thousand patents from a patent database according to the competitor list, and controlling the number of the patents owned by the company to be about 20 to form a Uspto-Yahoo data set in order to ensure data balance and experimental results.
The cleaning of the data set includes extracting data from the patent database according to the competition pair truth value list, and filtering out the data if one or both companies cannot find the patent data in the patent database. And then, de-duplicating the patent data, and filtering null values and invalid data.
In one possible embodiment, the competitor of "the center communication stock limited" is "digital tech company in the united states".
Step 102, determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic conjoined edges.
In the embodiment of the application, the patent title, abstract and claim specification are specifically obtained from the patent data set to determine the semantic similarity of the patent, and meanwhile, company, patent, inventor and field are associated and a patent heterogeneous information network centering on the company and the patent is constructed.
Specifically, for the document content collection t= { tc 1 ,tc 2 ,…tc |D| Computing word w in document content tc based on word frequency-inverse text frequency index TF-IDF i The weight of (a) refers to the following formula:
Figure SMS_15
wherein tci Is patent p i Is used to determine the content of the text of (a),
Figure SMS_16
i D I is the total number of documents, m w The number of documents, tf (w, tc) for the word w to appear at least once i ) Appears in the document content tc for word w i The number of times in (a);
acquiring document content tc according to the weight i The formula is as follows:
Figure SMS_17
wherein ,
Figure SMS_18
is word w i N-dimensional word vector of (a) each patent text feature vector +.>
Figure SMS_19
Also N-dimensional.
Calculate patent tc in T i and tcj Similarity of text content, patent semantic similarity ranking is calculated according to the following formula:
Figure SMS_20
p i p=Topn(Sim 1 (p i ,p j )),
wherein tsim (,) is a function of calculating similarity, ψ (p) i ,p j ) Is a binary index, lambda is patent tc i and tcj Is used to obtain the top n highest similarity patent lists.
Psi (p) i ,p j ) =1, otherwise ψ (p i ,p j )=0,λ∈[0,1]The degree of correlation of the two terms in the formula is expressed.
In a possible embodiment, λ=0.2 is set.
Constructing patent semantic continuous edges PL= { p according to similarity ranking 1 p,p 2 p,...,p |D| p}。
After obtaining the patent semantic edge, constructing a patent heterogeneous information network centering on companies and patents, and specifically describing the following:
for a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ), where V, E, H represent nodes, edges, weight sets, respectively, ζ is an object type mapping function, ψ is a relation type mapping function, μ is a weight type mapping function, whereby weights of the relation between company nodes are defined as follows:
Figure SMS_21
/>
where u is the weight attenuation value, rank index For the index of patent semantic similarity ranking, if the relation between nodes is a patent semantic continuous edge, the edge weight of the relation is g-u×rank index
In a possible embodiment, g=0.1 is set and u=g/n.
And step 103, acquiring company nodes and structural features of the company nodes in the patent heterogeneous information network through a graph embedding method.
In the embodiment of the application, a meta-path P: a is selected in a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ) 1 →A 2 →…→A l+1 Meta-path walk follows the following distribution:
Figure SMS_22
wherein ,nt The type of v is A for the t node traversed in the random walk t ,
Figure SMS_23
Is of the type A t+1 A first order set of neighbors of node v.
In this embodiment, the patent node is denoted by symbol P, the company node is denoted by symbol C, the inventor node is denoted by symbol I, the domain node is denoted by symbol F, and given the meta-paths CPC and CPIPC, two sample node sequences are generated, for example, starting from Apple company node: (1) Apple (Apple) company →P1 patent →Qualcomm company And (2) Apple Company →P2 patent →Mary inventor →P3 patent →Huawei company Likewise, given a meta-path CFC, another sequence of nodes may be generated: apple (Apple) Company →G06K CPC →Huawei company
The structural characteristics of the company nodes meet the co-occurrence probability maximization formula, and the formula is expressed as follows:
Figure SMS_24
wherein ,vi V as the current node j Is a context node.
In a possible embodiment, the number num=10, the stride length l=10, the window size w=5, and the negative sampling n=5 of the network embedded random walk paths are set.
Step 104, screening the company nodes, and integrating vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes.
In the embodiment of the application, node embedding is converted through nonlinear transformation, and the node embedding is carried out according to the attention vector
Figure SMS_25
The attention value of node i in meta-path Z is obtained and formulated as follows:
ω i =q T ·tanh(W·(e i ) T +b),
wherein W is a weight matrix, b is a bias value, e i Is an embedded representation;
using an activation function as the attention value ω corresponding to node i in the meta-path Z i Normalization processing is performed, and the formula is as follows:
Figure SMS_26
wherein ,αi For the embedded importance value of the inode,
Figure SMS_27
the larger the value of (c) indicates the higher the embedding importance of its corresponding node.
For all nodes in meta-path Z, the weight matrix is
Figure SMS_28
For obtaining meta-path Z of company node through graph embedding method screening C ,Z C′ Random walk patent heterogeneous information network screening to obtain meta-path Z of company node G The embedding matrix of the company node is:
Z A =α C ·Z CC′ ·Z C′G ·Z G
wherein ,αC For element path Z C Learning of all nodes in a networkWeight matrix, alpha C′ For element path Z C′ Learning weight matrix of all nodes in a network, alpha G For element path alpha G A learning weight matrix of all nodes in the network.
In addition, in obtaining the embedding matrix of the company node, the loss function is:
Figure SMS_29
wherein ,E+ E is the total node pair, t is the hyper-parameter.
In a possible embodiment, the semantic link n=3, the embedding dimension d=400, the learning rate 0.0001, and the maximum iteration number epoch=110 are set.
And 105, calculating cosine similarity among company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.
In the embodiment of the application, the cosine similarity among company nodes is calculated according to the embedded matrix by the following formula:
Figure SMS_30
wherein ,ci ,c j Is a corporate node embedded in the matrix.
Specifically, k most likely competitors are recommended to the target company, and k is a parameter specifying the number of recommended competitors.
In a possible embodiment, k=3, 10, 20, 30 is set.
According to the embodiment of the application, the network is constructed according to the patent data, the patent heterogeneous information network centering on the patent-company is constructed on the semantic links for the excavation of competitors, the rich structure and semantic information of the nodes in the network are learned by using different graph embedding methods, and the attention mechanism is introduced to fuse the target node representation for effectively being used for the excavation of competitors, so that the excavation performance of competitors is improved, and the effectiveness is improved.
In addition, the patent data set is a Chinese-English data set and can be suitable for different language scenes.
Fig. 2 is a flow chart of an apparatus 200 for mining competitors based on proprietary heterogeneous information networks, according to an exemplary embodiment of the present application, comprising an acquisition module 210, a construction module 220, an embedding module 230, an optimization module 240, and a competitor mining module 250.
The acquisition module 210 acquires a competition pair true value, and extracts and cleans patent data from the patent database according to the competition pair true value to construct a patent data set;
the construction module 220 is used for determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity and constructing a patent heterogeneous information network according to the patent semantic conjoined edges;
the embedding module 230 acquires company nodes and structural features of the company nodes in the patent heterogeneous information network through a graph embedding method;
the optimizing module 240 screens the company nodes, synthesizes vector representation of the company nodes through the attention mechanism, and obtains an embedding matrix of the company nodes;
the competitor mining module 250 calculates cosine similarity between company nodes according to the embedded matrix, wherein the result with the highest cosine similarity is taken as a candidate competitor of the target company.
The specific manner in which the operations of the various methods are performed in relation to the methods of the embodiments described above have been described in detail in relation to the embodiments of the system and will not be described in detail herein.
FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 3, the apparatus 300 includes a computing unit 301 that may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 may also be stored. The computing unit 301, the ROM 302, and the RAM 303 are connected to each other by a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, etc.; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, an optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 301 performs the respective methods and processes described above, such as a voice instruction response method. For example, in some embodiments, the voice instruction response method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 300 via the ROM 302 and/or the communication unit 309. When the computer program is loaded into RAM 303 and executed by computing unit 301, one or more steps of the voice instruction response method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the voice instruction response method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (10)

1. A method for mining competitors based on patent heterogeneous information network, comprising:
acquiring a competition pair true value, and extracting and cleaning patent data from a patent database according to the competition pair true value to construct a patent data set;
determining patent semantic similarity according to the patent data set, constructing patent semantic conjoined edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic conjoined edges;
obtaining company nodes in the patent heterogeneous information network and the structural characteristics of the company nodes by a graph embedding method;
screening the company nodes, and synthesizing vector representation of the company nodes through an attention mechanism to obtain an embedding matrix of the company nodes;
and calculating cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.
2. The method of claim 1, wherein the extracting and cleansing patent data from the patent database to construct a patent data set comprises:
and obtaining the competition pair true value by using a crawler technology, de-duplicating the patent data and filtering null value and invalid data.
3. The method of claim 1, wherein the determining patent semantic similarity from the patent dataset, and constructing patent semantic edges from the patent semantic similarity, comprises:
for a document content set t= { tc 1 ,tc 2 ,…tc |D| Computing word w in document content tc based on word frequency-inverse text frequency index TF-IDF i The weight of (a) refers to the following formula:
Figure FDA0003940947310000011
wherein tci Is patent p i Is used to determine the content of the text of (a),
Figure FDA0003940947310000012
i D I is the total number of documents, m w For the number of documents in which the word w appears at least once, tf (w, tc i ) For the word w to appear in the document content tc i The number of times in (a);
acquiring the document content tc according to the weight i The formula is as follows:
Figure FDA0003940947310000013
wherein ,
Figure FDA0003940947310000014
is word w i N-dimensional word vectors of (a);
calculate patent tc in T i and tcj Similarity of text content, the patent semantic similarity ranking is calculated according to the following formula:
Figure FDA0003940947310000021
p i p=Topn(Sim 1 (p i ,p j )),
wherein tsim (,) is a function of calculating similarity, ψ (p) i ,p j ) Is a binary index, lambda is patent tc i and tcj For obtaining the first n highest similarity patent lists;
constructing a patent semantic continuous edge PL= { p according to the similarity rank 1 p,p 2 p,...,p |D| p}。
4. A method according to claim 3, wherein said constructing a patent heterogeneous information network from said patent semantic links comprises:
for a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ), where V, E, H represent nodes, edges, weight sets, respectively, ζ is an object type mapping function, ψ is a relation type mapping function, μ is a weight type mapping function, whereby the weights of the relationships between the nodes are defined as follows:
Figure FDA0003940947310000022
where u is the weight attenuation value, rank index For the index of the patent semantic similarity rank, if the relation between the nodes is a patent semantic continuous edge, the edge weight of the relation is g-u×rank index
5. The method according to claim 1, wherein the obtaining, by a graph embedding method, the company node in the patent heterogeneous information network and the structural feature of the company node, includes:
one element path P: a is selected in a given patent heterogeneous information network G' = (V, E, H, ζ, ψ, μ) 1 →A 2 →…→A l+1 The meta-path walk follows the following distribution:
Figure FDA0003940947310000023
wherein ,nt The type of v is A for the t node traversed in the random walk t ,
Figure FDA0003940947310000024
Is of the type A t+1 A first order set of neighbors of node v.
6. The method of claim 1, wherein the structural feature in which the corporate node is located satisfies a co-occurrence probability maximization formula, the formula expressed as:
Figure FDA0003940947310000025
wherein ,vi V as the current node j Is a context node.
7. The method of claim 1, wherein said filtering said company nodes, synthesizing vector representations of said company nodes by an attention mechanism, to obtain an embedding matrix of said company nodes, comprises:
transforming node embedding by nonlinear transformation, based on attention vectors
Figure FDA0003940947310000031
The attention value of node i in meta-path Z is obtained and formulated as follows:
ω i =q T ·tanh(W·e i ) T +b),
wherein W is a weight matrix, b is a bias value, e i Is an embedded representation;
using an activation function for the attention value omega corresponding to node i in the meta-path Z i Normalization processing is performed, and the formula is as follows:
Figure FDA0003940947310000032
wherein ,αi An embedded importance value for the inode;
for all nodes in meta-path Z, the weight matrix is
Figure FDA0003940947310000033
For obtaining meta-path Z of company node through graph embedding method screening C ,Z C′ Randomly walking the patent heterogeneous information network to screen and obtain a meta-path Z of a company node G The embedding matrix of the company node is as follows:
Z A =α C ·Z CC′ ·Z C′G ·Z G
wherein ,αC For element path Z C Learning weight matrix of all nodes in a network, alpha C′ For element path Z C′ Learning weight matrix of all nodes in a network, alpha G For element path alpha G A learning weight matrix of all nodes in the network.
8. The method of claim 7, wherein in said obtaining said embedding matrix of corporate nodes, a loss function is:
Figure FDA0003940947310000034
wherein ,E+ E is the total node pair, t is the hyper-parameter.
9. The method of claim 1, wherein the cosine similarity between the company nodes is calculated from the embedding matrix by the formula:
Figure FDA0003940947310000035
wherein ,ci ,c j Is a corporate node in the embedded matrix.
10. An apparatus for mining competitors based on patent heterogeneous information networks, comprising:
the acquisition module acquires a competition pair true value, and extracts and cleans patent data from a patent database according to the competition pair true value to construct a patent data set;
the construction module is used for determining patent semantic similarity according to the patent data set, constructing patent semantic continuous edges according to the patent semantic similarity, and constructing a patent heterogeneous information network according to the patent semantic continuous edges;
the embedding module is used for acquiring company nodes in the patent heterogeneous information network and the structural characteristics of the company nodes through a graph embedding method;
the optimization module screens the company nodes, synthesizes vector representation of the company nodes through an attention mechanism, and obtains an embedding matrix of the company nodes;
and the competitor mining module calculates cosine similarity among the company nodes according to the embedded matrix, wherein a result with highest cosine similarity is taken as a candidate competitor of the target company.
CN202211421320.5A 2022-11-14 2022-11-14 Method and device for excavating competitors based on patent heterogeneous information network Active CN115641009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211421320.5A CN115641009B (en) 2022-11-14 2022-11-14 Method and device for excavating competitors based on patent heterogeneous information network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211421320.5A CN115641009B (en) 2022-11-14 2022-11-14 Method and device for excavating competitors based on patent heterogeneous information network

Publications (2)

Publication Number Publication Date
CN115641009A CN115641009A (en) 2023-01-24
CN115641009B true CN115641009B (en) 2023-05-05

Family

ID=84948116

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211421320.5A Active CN115641009B (en) 2022-11-14 2022-11-14 Method and device for excavating competitors based on patent heterogeneous information network

Country Status (1)

Country Link
CN (1) CN115641009B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112735B (en) * 2023-10-19 2024-02-13 中汽信息科技(天津)有限公司 Patent database construction method and electronic equipment
CN117807275A (en) * 2023-12-29 2024-04-02 江南大学 Heterogeneous graph embedding method and system based on relation mining

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175224A (en) * 2019-06-03 2019-08-27 安徽大学 Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN111831913A (en) * 2020-07-17 2020-10-27 深圳龙图腾创新设计有限公司 Potential competitor information recommendation method, device, equipment and storage medium
CN112182183A (en) * 2020-09-28 2021-01-05 厦门理工学院 Patent harmful effect knowledge mining method, device, equipment and storage medium
CN112182424A (en) * 2020-11-11 2021-01-05 重庆邮电大学 Social recommendation method based on integration of heterogeneous information and isomorphic information networks
CN113190754A (en) * 2021-05-11 2021-07-30 四川大学 Recommendation method based on heterogeneous information network representation learning
CN113779264A (en) * 2021-08-29 2021-12-10 北京工业大学 Trade recommendation method based on patent supply and demand knowledge graph
CN113836398A (en) * 2021-08-29 2021-12-24 北京工业大学 Patent transaction recommendation method based on attribute heterogeneous network representation learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015481A1 (en) * 2002-05-23 2004-01-22 Kenneth Zinda Patent data mining

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175224A (en) * 2019-06-03 2019-08-27 安徽大学 Patent recommended method and device based on semantic interlink Heterogeneous Information internet startup disk
CN111831913A (en) * 2020-07-17 2020-10-27 深圳龙图腾创新设计有限公司 Potential competitor information recommendation method, device, equipment and storage medium
CN112182183A (en) * 2020-09-28 2021-01-05 厦门理工学院 Patent harmful effect knowledge mining method, device, equipment and storage medium
CN112182424A (en) * 2020-11-11 2021-01-05 重庆邮电大学 Social recommendation method based on integration of heterogeneous information and isomorphic information networks
CN113190754A (en) * 2021-05-11 2021-07-30 四川大学 Recommendation method based on heterogeneous information network representation learning
CN113779264A (en) * 2021-08-29 2021-12-10 北京工业大学 Trade recommendation method based on patent supply and demand knowledge graph
CN113836398A (en) * 2021-08-29 2021-12-24 北京工业大学 Patent transaction recommendation method based on attribute heterogeneous network representation learning

Also Published As

Publication number Publication date
CN115641009A (en) 2023-01-24

Similar Documents

Publication Publication Date Title
CN115641009B (en) Method and device for excavating competitors based on patent heterogeneous information network
TWI729472B (en) Method, device and server for determining feature words
CN111753914B (en) Model optimization method and device, electronic equipment and storage medium
CN113590645B (en) Searching method, searching device, electronic equipment and storage medium
US20210312139A1 (en) Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium
CN111522967B (en) Knowledge graph construction method, device, equipment and storage medium
CN104899322A (en) Search engine and implementation method thereof
CN110532352B (en) Text duplication checking method and device, computer readable storage medium and electronic equipment
CN112559747B (en) Event classification processing method, device, electronic equipment and storage medium
US20140365494A1 (en) Search term clustering
CN116028618B (en) Text processing method, text searching method, text processing device, text searching device, electronic equipment and storage medium
CN112989235A (en) Knowledge base-based internal link construction method, device, equipment and storage medium
CN107527289B (en) Investment portfolio industry configuration method, device, server and storage medium
CN110334343A (en) The method and system that individual privacy information extracts in a kind of contract
CN116401345A (en) Intelligent question-answering method, device, storage medium and equipment
CN107133274B (en) Distributed information retrieval set selection method based on graph knowledge base
CN113033194B (en) Training method, device, equipment and storage medium for semantic representation graph model
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN114491232B (en) Information query method and device, electronic equipment and storage medium
CN112860626B (en) Document ordering method and device and electronic equipment
CN112507181B (en) Search request classification method, device, electronic equipment and storage medium
CN112989190A (en) Commodity mounting method and device, electronic equipment and storage medium
CN112528644A (en) Entity mounting method, device, equipment and storage medium
CN111639599A (en) Object image mining method, device, equipment and storage medium
CN114422584B (en) Method, device and storage medium for pushing resources

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant