CN111680252A

CN111680252A - External link identification method, device, equipment and computer readable storage medium

Info

Publication number: CN111680252A
Application number: CN202010511107.8A
Authority: CN
Inventors: 康战辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-09-18
Anticipated expiration: 2040-06-05
Also published as: CN111680252B

Abstract

The embodiment of the application discloses an external link identification method, an external link identification device, external link identification equipment and a computer readable storage medium, wherein the method comprises the following steps: generating an adjacency graph between the publishing point and the external chain according to the content data which is published by the content publishing point and contains the external chain in the first time period; the distribution point in the adjacency graph is connected with the outer chain distributed by the distribution point, the first outer chain and the second outer chain in the adjacency graph are connected by the edge, and the first outer chain and the second outer chain are two different outer chains respectively distributed in different distribution units of the same distribution point; according to the number of the publishing units containing the outer chains, calculating a first weight of an edge between the publishing point and the outer chain published by the publishing point, and calculating a second weight of an edge between the first outer chain and the second outer chain; generating corresponding outer chain characteristics of the content publishing point based on the adjacency graph, the first weight and the second weight; and inputting the outer chain characteristics into the trained outer chain cheating recognition model to obtain a recognition result.

Description

External link identification method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of external chain identification technologies, and in particular, to an external chain identification method, an external chain identification device, an external chain identification apparatus, and a computer-readable storage medium.

Background

With the rapid development of the mobile internet era, users have moved a lot of time to mobile terminals, and from the portal of the PC era to the rise of various self-media of the mobile era, channels and ways for users to acquire content are increasing, such as blogs, microblogs, posts, and public systems based on various instant messaging tools.

The public number refers to an application account number applied by a developer or a merchant on a public platform, the account number is communicated with an instant messaging account number, and the merchant can realize all-round communication and interaction with characters, pictures, voice and videos of a specific group on the public platform through the public number. The public number platform is currently of wide interest in the industry as a current subscription-based personal self-media platform, with hundreds of thousands of users looking for a desired public number article through a wechat search portal each day. Just because of such huge traffic, similarly to cheating in the web page era, some beneficial and driven public owners often obtain cheating behaviors of illegal click traffic by piling texts containing hot things through titles or mixing some cheating words which do not accord with the text theme in the texts.

In order to prevent the public number system from attacking hot spot stacking and text anti-cheating of the public number articles, some public number owners try to add cheating type external links in pages of 'reading original text' to induce users to jump to external network cheating type website pages through the external links when 'reading full text', and then conduct actions of possibly invading physical and mental health of the users, yellow-related wading of money and money, including gambling, induction of shopping and the like.

How to manage the link cheating forms existing in the public number and how to effectively identify whether the outer link cheats is a hot problem for the research of technical personnel.

Disclosure of Invention

The embodiment of the application provides an external link identification method, a related device, equipment and a computer readable storage medium, which can effectively identify whether an external link is cheated.

In a first aspect, an embodiment of the present application provides an external link identification method, where the method includes:

generating an adjacency graph between the publishing point and the external chain according to the content data which is published by the content publishing point and contains the external chain in the first time period; the distribution point in the adjacency graph is connected with the outer chain distributed by the distribution point, the first outer chain and the second outer chain in the adjacency graph are connected by the edge, and the first outer chain and the second outer chain are two different outer chains respectively distributed in different distribution units of the same distribution point;

according to the number of the publishing units containing the outer chains, calculating a first weight of an edge between the publishing point and the outer chain published by the publishing point, and calculating a second weight of an edge between the first outer chain and the second outer chain;

inputting the adjacency graph, the first weight and the second weight into a trained outer chain cheating recognition model to obtain a recognition result; the identification result represents whether the content publishing point publishes the cheating outer chain.

By implementing the embodiment of the application, an adjacency graph between the release point and the outer chain is generated, the outer chain characteristics corresponding to the release point are generated by utilizing the first weight of the edge between the release point and the outer chain released by the adjacency graph, the second weight of the edge between the first outer chain and the second outer chain and the graph structure of the adjacency graph, and the outer chain characteristics are input into a trained outer chain cheating recognition model to obtain a recognition result. The accuracy of outer chain identification can be improved. And when the accuracy of the outer chain recognition is achieved, the outer chain cheating recognition model is trained and recognized through the graph structure and the relevant characteristics of the adjacent graph between the release point and the outer chain, and compared with the prior art, the electronic equipment is higher in operation efficiency, occupies fewer computer operation resources, and therefore can improve the computer performance.

In one possible implementation manner, the first weight of the edge between the publishing point and the external chain published by the publishing point is the number of publishing units in the publishing point, which contain the external chain.

In one of the possible implementations of the invention,

the calculating a second weight of an edge between the first outer chain and the second outer chain according to the number of the issue units containing the outer chain comprises:

calculating a second weight of an edge between the first outer chain and the second outer chain according to the number of the issuing units of each issuing point in the issuing point set, wherein the issuing points respectively comprise the first outer chain and the second outer chain; the set of publishing points includes a set of publishing points that publish content data containing the first outer chain and the second outer chain.

In one of the possible implementations of the invention,

calculating a second weight of an edge between the first outer chain and the second outer chain by:

wherein, Url_iIs a first outer chain; url_jIs a second outer strand; ComUin (i, j) is the set of distribution points; DocCnt_u,iIssued for issue point u with Url_jThe number of publication units of (a); DocCnt_u,jIssued for issue point u with Url_jThe number of publication units of (a); DocCnt_.,iThe issue corresponding to each issue point in the issue point set contains Url_iThe number of publication units of (a); DocCnt_.,jThe issue corresponding to each issue point in the issue point set contains Url_jThe number of publication units of (a); n is the number of distribution points in the distribution point set

In one possible implementation manner, the outer-chain cheating recognition model comprises a model trained based on a network embedding algorithm.

In one of the possible implementations of the invention,

the network embedding algorithm comprises a deep walking Deepwalk algorithm;

the inputting the adjacency graph, the first weight and the second weight into a trained outer-chain cheating recognition model comprises:

generating a random walk sequence based on the adjacency graph, the first weight, and the second weight;

and taking the nodes in the adjacency graph as words, learning the characteristic vectors of the nodes in the random walk sequence by using a word2vec algorithm, and outputting a characteristic vector matrix. Word2vec, a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text.

In one of the possible implementations of the invention,

the calculating a first weight of an edge between the publishing point and the external chain published by the publishing point according to the number of the publishing units containing the external chain and a second weight of an edge between the first external chain and the second external chain comprises:

based on the number of publication units containing outer chains and at least one factor of: an authority value of the content publishing point; the release time of the content data; or the frequency with which the content is distributed by the content distribution point;

and calculating a first weight of an edge between the publishing point and an external chain published by the publishing point, and calculating a second weight of an edge between the first external chain and the second external chain.

In a second aspect, an embodiment of the present application provides an external chain identification device, including:

the adjacency graph generating unit is used for generating an adjacency graph between the publishing point and the external chain according to the content data which is published by the content publishing point and contains the external chain in the first time period; the distribution point in the adjacency graph is connected with the outer chain distributed by the distribution point, the first outer chain and the second outer chain in the adjacency graph are connected by the edge, and the first outer chain and the second outer chain are two different outer chains respectively distributed in different distribution units of the same distribution point;

the calculation unit is used for calculating a first weight of an edge between the release point and an outer chain released by the calculation unit according to the number of the release units containing the outer chain, and calculating a second weight of an edge between the first outer chain and the second outer chain;

a feature generation unit, configured to generate an out-link feature corresponding to the content publishing point based on the adjacency graph, the first weight, and the second weight;

the recognition unit is used for inputting the outer chain characteristics into a trained outer chain cheating recognition model to obtain a recognition result; the identification result represents whether the content publishing point publishes the cheating outer chain.

In one possible implementation manner, the calculating unit is specifically configured to: calculating a second weight of an edge between the first outer chain and the second outer chain according to the number of the issuing units of each issuing point in the issuing point set, wherein the issuing points respectively comprise the first outer chain and the second outer chain; the set of publishing points includes a set of publishing points that publish content data containing the first outer chain and the second outer chain.

In one possible implementation, the second weight of the edge between the first outer chain and the second outer chain is calculated by the following formula:

wherein, Url_iIs a first outer chain; url_jIs a second outer strand; ComUin (i, j) is the set of distribution points; DocCnt_u,iIssued for issue point u with Url_jThe number of publication units of (a); DocCnt_u,jIssued for issue point u with Url_jThe number of publication units of (a); DocCnt_.,iThe issue corresponding to each issue point in the issue point set contains Url_iThe number of publication units of (a); DocCnt_.,jThe issue corresponding to each issue point in the issue point set contains Url_jThe number of publication units of (a); and N is the number of the distribution points in the distribution point set.

In one possible implementation manner, the feature generation unit is specifically configured to: and generating corresponding out-link characteristics of the content publishing point through a network embedding algorithm based on the adjacency graph, the first weight and the second weight.

In one possible implementation, the network embedding algorithm includes a deep walking DeepWalk algorithm;

the feature generation unit may be specifically configured to:

and taking the nodes in the adjacency graph as words, learning the characteristic vectors of the nodes in the random walk sequence by using a word2vec algorithm, and outputting a characteristic vector matrix.

In one possible implementation manner, the computing unit is specifically configured to: based on the number of publication units containing outer chains and at least one factor of: an authority value of the content publishing point; the release time of the content data; or the frequency with which the content is distributed by the content distribution point;

In a third aspect, an embodiment of the present application provides an external link identification device, including a processor, where the processor is configured to invoke a stored program instruction, and to execute the method in the first aspect and each possible implementation manner.

In a fourth aspect, the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the method in the first aspect and each possible implementation manner.

In a fifth aspect, the present application further provides a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, the processor is caused to execute the method in the first aspect and each possible implementation manner.

It should be understood that the second to fifth aspects of the present application are consistent with the technical solutions of the first aspect of the present application, and the beneficial effects obtained by the aspects and the corresponding possible embodiments are similar, and are not described again.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.

Fig. 1 is a schematic diagram of an architecture of an external link identification system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an external link identification method according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an adjacency graph provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an original graph provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of the principle of deep walking provided by the embodiments of the present application;

FIG. 6 is a schematic diagram of the principle of random walk provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an external chain identification device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an external chain identification device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In order to better understand the outer chain identification method provided by the embodiment of the application, some concepts related to the application are explained first.

The content distribution point (i.e., distribution point) in the embodiment of the present application may specifically include a common website, or a public number, or a web log (e.g., blog), and the like. The publishing unit refers to an element on a publishing point, for example, a website is divided into content division areas, for example, the website is divided into a plurality of areas according to content types, such as a current news area, a sports area, an entertainment area, and the like, each content division area can publish an article, and the published articles can be publishing units. An article published as in the public number is also a publishing unit.

Specifically, the outer chain identification method according to the embodiment of the present application may be applied to the following scenario a and scenario B, but the outer chain identification method according to the embodiment of the present application is not limited to these two application scenarios. A brief description of scene a and scene B is provided below.

Scene A:

the user opens an article for the website. For example, a news content block of a website is accessed, and after the news browsing page is accessed, a user can browse news articles in the news browsing page based on the current page. Then, for the article containing the outer link on the website, the article can be identified by the outer link identification method of the embodiment of the application to identify whether the website issues the cheating outer link.

Scene B:

the user opens a public article. For example, the user enters the public number a, and clicks to browse the published article on the page of the public number a. Then, for the article containing the outer chain on the public number a, the article can be identified by the outer chain identification method of the embodiment of the application to identify whether the public number a issues the cheated outer chain.

Since the embodiments of the present application relate to the application of a neural network, for the convenience of understanding, the related terms and related concepts such as the neural network related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, the neural units may refer to operation units with xs and intercept 1 as inputs, and the output of the operation units may be:

where s is 1, 2, … … n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, where "many" has no particular metric. From the division of DNNs by the location of different layers, neural networks inside DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector, b is the offset vector, W is the weight matrix (also called coefficient), α () is the activation function

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the number of coefficients W and offset vectors b is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network includes a feature extractor consisting of convolutional layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. The same learned image information can be used for all positions on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Recurrent Neural Networks (RNNs) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although the common neural network solves a plurality of problems, the common neural network still has no capability for solving a plurality of problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN. The error back-propagation algorithm is also used, but with a little difference: that is, if the RNN is network-deployed, the parameters therein, such as W, are shared; this is not the case with the conventional neural networks described above by way of example. And in using the gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the networks of the previous steps. This learning algorithm is referred to as the Time-based Back Propagation Through Time (BPTT).

Now that there is a convolutional neural network, why is a circular neural network? For simple reasons, in convolutional neural networks, there is a precondition assumption that: the elements are independent of each other, as are inputs and outputs, such as cats and dogs. However, in the real world, many elements are interconnected, such as stock changes over time, and for example, a person says: i like to travel, wherein the favorite place is Yunnan, and the opportunity is in future to go. Here, to fill in the blank, humans should all know to fill in "yunnan". Because humans infer from the context, but how do the machine do it? The RNN is generated. RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

(5) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, the process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(6) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of the parameters in the initial sample generator in the training process, so that the reconstruction error loss of the initial sample generator is smaller and smaller. Specifically, an error loss occurs when an input signal is transmitted in a forward direction until an output, and parameters in an initial sample generator are updated by back-propagating error loss information, so that the error loss is converged. The back propagation algorithm is an error-loss dominated back propagation motion aimed at obtaining optimal parameters of the target sample generator, such as a weight matrix.

(7) Generative countermeasure network

Generative Adaptive Networks (GANs) are a deep learning model. The model comprises at least two modules: one module is a generative model (also referred to as a generative network in the embodiment of the present application), and the other module is a discriminant model (also referred to as a discriminant network in the embodiment of the present application), and the two modules perform game learning with each other, so as to generate better output. The generative model and the discriminant model may be both neural networks, specifically, deep neural networks, or convolutional neural networks. The basic principle of GAN is as follows: taking GAN for generating pictures as an example, assume that there are two networks, G (generator) and d (discriminator), where G is a network for generating pictures, which receives a random noise z, and generates pictures by this noise, denoted as G (z); d is a discrimination network for discriminating whether a picture is "real". The input parameter is x, x represents a picture, and the output D (x) represents the probability that x is a real picture, if the probability is 1, 100% of the picture is a real picture, and if the probability is 0, the picture cannot be a real picture. In the process of training the generating countermeasure network, the aim of generating the network G is to generate a real picture as much as possible to deceive the discrimination network D, and the aim of discriminating the network D is to distinguish the picture generated by G from the real picture as much as possible. Thus, G and D constitute a dynamic "gaming" process, i.e., "play" in a "generative play network". As a result of the final game, in an ideal state, G can generate enough pictures G (z) to be "fake" and D cannot easily determine whether the generated pictures are true or not, i.e., D (G (z)) is 0.5. This results in an excellent generative model G which can be used to generate pictures.

The system architecture provided by the embodiments of the present application is described below.

Referring to fig. 1, an embodiment of the present application provides a schematic diagram of an external chain recognition system architecture, where the system architecture for external chain recognition in the figure may include a model training process and an external chain recognition process. Wherein:

in the model training process:

and training the outer chain cheating recognition model by inputting sample data. The sample data comprises negative sample marking data of the public articles containing some cheating external chains and positive sample marking data of the public articles not containing the cheating external chains. And extracting outer chain features from the input sample data, and training to construct an outer chain cheating recognition model based on outer chain feature analysis.

The external link feature is extracted from the input sample data, and the external link feature corresponding to the sample public number can be generated based on a sample adjacency graph between a publishing point and an external link generated by sample content data containing the external link published by a content publishing point, as well as a first sample weight and a second sample weight.

The first outer chain and the second outer chain in the sample adjacency graph are connected through edges, and the first outer chain and the second outer chain are two different outer chains which are respectively issued in different issue units of the same issue point; and calculating a first sample weight of an edge between the publishing point and the self-publishing outer chain and calculating a second sample weight of the edge between the first outer chain and the second outer chain according to the number of publishing units containing the outer chain.

In the process of identifying the outer chain:

generating an adjacency graph between the publishing point and the external chain through content data which is published by the content publishing point in a first time period and contains the external chain; and generating an outer chain characteristic corresponding to the content publishing point according to the adjacency graph, the first weight and the second weight, and inputting the outer chain characteristic into the trained outer chain cheating recognition model to output a recognition result. The recognition result represents whether the content distribution point distributes the cheated outer chain.

The first outer chain and the second outer chain in the adjacency graph are connected through edges, and the first outer chain and the second outer chain are two different outer chains which are respectively issued in different issuing units of the same issuing point; and calculating a first weight of an edge between the publishing point and the external chain published by the publishing point according to the number of the publishing units containing the external chain, and calculating a second weight of an edge between the first external chain and the second external chain.

How the external chain identification is completed in the present application is described below with reference to a flowchart of an external chain identification method provided in an embodiment of the present application shown in fig. 2, where a main body executing the external chain identification method may be an electronic device such as a server, and specifically, the following steps may be executed:

step S200: generating an adjacency graph between the publishing point and the external chain according to the content data which is published by the content publishing point and contains the external chain in the first time period;

the content publishing point in the embodiment of the present application includes, but is not limited to, a website, or a public number, or a weblog (such as a blog), and the published content data thereof may include an article. The first time period may be a preset time period, such as 3 months, or 6 months, or the like, or may be adaptively adjusted according to the amount of collected content data, which is not limited in the present application.

Fig. 3 is a schematic structural diagram of an adjacency graph provided in the embodiment of the present application, where the adjacency graph includes at least one publishing point and its own published outer chain. The issuing point is connected with the external chain issued by the issuing point through an edge. If a first outer chain and a second outer chain exist in the outer chain, the first outer chain and the second outer chain can be connected through an edge. The first outer chain and the second outer chain refer to two different outer chains which are respectively issued in different issue units of the same issue point. The adjacency graph may include a plurality of first outer chains and a plurality of second outer chains.

For example, the publishing unit in the embodiment of the present application may include an article published by a website, for example, the website is divided into a plurality of areas according to content types, such as a current news area, a sports area, an entertainment area, and the like, and each content block area publishes the article; the out-links published by different publication units can be out-links in different articles published for the same or different content chunking areas. The publishing unit in the embodiment of the present application may also include articles of public numbers; the outer links published by different publication units may be outer links in different articles.

Taking a publishing point as a public number as an example, in one possible implementation manner, when generating an adjacency graph, according to a large amount of public number-article-link data precipitated by a public number platform in a last half year, the following original graph structure may be formed, for example, as shown in fig. 4, a schematic structural diagram of an original graph provided by the embodiment of the present application includes at least one publishing point, an article published by the publishing point, and an external link included in the article. The article layer in the three-layer network structure can then be removed, i.e., an adjacency graph of the public number-outer chain of the present application is constructed.

In one possible implementation manner, before or after removing the article layer in the three-layer network structure, a filtering process may be further included, including filtering out low-frequency links therein, that is, filtering out links whose publication frequency is lower than or equal to a preset threshold value. For example, filtering is performed before removing the article layer in the three-layer network structure, the original graph is filtered to filter out low-frequency links therein. If the filtering process is performed after the article layer in the three-layer network structure is removed, the filtering process is performed on the adjacency graph of the public number-external link, so as to filter out the low-frequency link therein.

Step S202: according to the number of the publishing units containing the outer chains, calculating a first weight of an edge between the publishing point and the outer chain published by the publishing point, and calculating a second weight of an edge between the first outer chain and the second outer chain;

in one possible implementation manner, the first weight of the edge between the publishing point and the external chain published by the publishing point is the number of publishing units in the publishing point containing the external chain.

In one possible implementation manner, the calculating a second weight of an edge between the first outer chain and the second outer chain according to the number of the issue units including the outer chain may include:

calculating a second weight of an edge between the first outer chain and the second outer chain according to the number of the issuing units of each issuing point in the issuing point set, wherein the issuing points respectively comprise the first outer chain and the second outer chain; the set of publishing points includes a set of publishing points that publish content data that includes the first outer chain and the second outer chain.

In one possible implementation, step S202 may also be implemented by determining the number of publication units containing an outer chain and at least one of the following factors: an authority value of the content publishing point; the release time of the content data; or the frequency with which the content is distributed by the content distribution point; to calculate a first weight of an edge between the publishing point and an external chain published by itself, and to calculate a second weight of an edge between the first external chain and a second external chain.

Step S204: generating corresponding outer chain characteristics of the content publishing point based on the adjacency graph, the first weight and the second weight;

based on the adjacency Graph, the first weight and the second weight, the embodiment of the present application may generate the outer link feature corresponding to the content distribution point through a network embedding algorithm, and may also generate the outer link feature corresponding to the content distribution point through a Graph Convolution Network (GCN).

In one possible implementation, the network embedding algorithm includes a deep walk DeepWalk algorithm. The generating, by a network embedding algorithm, the content distribution point corresponding outer link feature based on the adjacency graph, the first weight, and the second weight may include:

generating a random walk sequence based on the adjacency graph, the first weight and the second weight; and taking the nodes in the adjacency graph as words, learning the characteristic vectors of the nodes in the random walk sequence by using a word2vec algorithm, and outputting a characteristic vector matrix.

In one possible implementation manner, the content data in step S200 may be content data issued by a plurality of different content distribution points, or may be content data issued by one content distribution point. If the content data is distributed by a plurality of different content distribution points, step S204 may generate an out-link feature corresponding to each of the plurality of content distribution points. If the content data is published by one content publishing point, step S204 is to identify the corresponding out-link feature of the content publishing point.

Step S206: inputting the outer chain characteristics into a trained outer chain cheating recognition model to obtain a recognition result; the identification result represents whether the content publishing point publishes the cheating outer chain. The recognition result characterizes whether the content distribution point distributes the cheated outer chain.

Some beneficial and driving public numbers often obtain cheating behaviors of illegal click traffic by piling texts containing hot things through titles or mixing some cheating words which are not consistent with the text theme in the articles, so aiming at the cheating behaviors, the article theme can be obtained through collecting cheating dictionaries by the key word extraction technology or the theme Model Topic Model technology, and then some supervised machine learning cheating classification models are applied by combining the hit condition of the cheating dictionaries, the text length, the picture occupation ratio in the texts and calculating the similarity of a user query string and a current article keyword list and other image-text characteristics (for example, collecting positive and negative samples for cheating and constructing two classifications based on a tree Model) to determine whether the article of the public number is a cheating article and the position to be ranked. However, the prior art has no effective solution to the link cheating modality existing in the public number.

The embodiment of the application provides an external chain identification method aiming at the link cheating form existing in the WeChat public number, the characteristics of the cheating external chain are identified by generating an adjacent graph between a release point and the external chain, and the characteristics of the external chain are input into an external chain cheating identification model (such as a supervision cheating classification model) as a cheating factor to identify the cheating form, so that whether the external chain cheats or not can be effectively identified, and the effective management of the link cheating form existing in the public number is realized.

In one possible implementation manner, the embodiment of the present application may calculate the second weight of the edge between the first outer chain and the second outer chain by using the following formula:

For example, the number of distribution points in the distribution point set is 3, taking the public number as an example, namely 3 public numbers P1, P2 and P3, for example, the public number P1 includes Url_iThe number of articles in (a) is 10,published under the public number P2 containing Url_iThe number of articles in (1) is 20, published under the public number P2 with Url_iIs 30, then

Is 10²+20²+30². Also, for example, published under the public number P4, contains Url_jThe number of articles in (1) is 14, and the publication of (B) in public number P5 contains Url_jThe number of articles in (1) is 18, published under the public number P6 with Url_jIs 35, then

Is 15²+18²+35²。

The following description will be given by taking the generation of the corresponding external link characteristics of the content distribution point based on the network embedding algorithm as an example:

the network embedding algorithm of the embodiment of the application can be a deep walk Deepwalk algorithm, a node steering quantity node2vec algorithm or the like. In the following, the DeepWalk algorithm is taken as an example, and in the social network example, network embedding is to use a low-dimensional vector to represent points in a network, and the vectors are to reflect some characteristics of an original network, for example, if the structures of two points in the original network are similar, the vectors represented by the two points should also be similar. According to the method and the device, each node in the generated adjacent graph between the release point and the outer chain can be regarded as a Word, then a random walk sequence is generated according to a graph structure (the structure of the adjacent graph), and then a Word steering vector Word2Vec algorithm is used for learning the feature vector of the node in the sequence.

As shown in fig. 5, which is a schematic diagram of the principle of deep walking provided by the embodiment of the present application, the input is the structure of the adjacency graph of the embodiment of the present application, and the adjacency graph is regarded as a network, a node V in the adjacency graph includes a publishing point and an outer chain, and an edge E in the adjacency graph includes an edge between the publishing point and the outer chain and an edge between the outer chain and the outer chain. And after random walk learning, outputting the feature vector of the node.

In one of the possible implementationsIn this way, as shown in fig. 6, which is a schematic diagram of the principle of random walk provided by the embodiment of the present application, random walk is to continuously and repeatedly select a walk path at random on the network, and finally form a path through the network. Starting from a specific end point, each step of the walking process randomly selects one edge from the edges connected with the current node, moves to the next vertex along the selected edge, and repeats the process. The dotted arrow shown in FIG. 6 is a random walk, denoted by v_iOne random walk path (dashed arrow) generated for the root node (i.e., vertex) is

Wherein the nodes passing through the path can be respectively marked as

The truncated random walk is actually a random walk of fixed length w.

In one possible implementation manner, deep walk algorithm adopted in the embodiment of the present application is deep walk (G, w, d, γ, t). Wherein G represents the adjacency graph, w represents the window size, d represents the dimension, γ represents the number of random walks per vertex, and t represents the step size of the random walks.

And G (V, E) is input, wherein V represents a node of the adjacency graph, and E represents an edge in the adjacency graph.

The output is a matrix of | V | × d

Each vertex has a d-dimensional continuous vector.

The depth walk algorithm may include the following process:

initializing a vector space of each vertex;

establishing Huffman tree (constructed according to the number of times of random walk vertex appears)

Enter the cycle from 0 to γ; equivalently, each node is subjected to gamma random walk;

v is obtained by disordering the sequence

Equivalent to disorganizing nodes in the network;

go through

Entering a loop for each vertex;

is obtained from v_iA random walk sequence with a step length t from a node; namely, it is

Updating parameters through the SkipGram model;

exiting the inner loop;

and exiting the outer circulation.

In one possible implementation manner, the skip gram (Φ,

w), Φ represents the current vertex vector,

represents a sequence generated by random walk and w represents a window size.

The method specifically comprises the following steps:

each node v in the traversal sequence_iEntering a circulation;

traverse v_iEach vertex of w before and after the vertex enters a cycle;

updating the parameters; e.g. J (Φ) ═ -logPr (u)_k|Φ(v_j))，

Exiting the inner loop;

and exiting the outer circulation.

Finally, a group of external chain vectors can be used for representing the public numbers, so that an external chain cheating recognition model based on the external chain feature vector can be trained and constructed based on the data labeled on positive and negative samples of the public number articles containing some cheating external chains during training. In the identification process, the external chain vector to be identified is input into the external chain cheating identification model, so that whether the external chain is cheated or not can be effectively identified, and the effective management of the link cheating form existing in the public number is realized.

In one possible implementation manner, the external-link cheating recognition model constructed in the embodiment of the application can also be a recognition model combining external-link feature vectors and image-text cheating features. Namely, the external chain cheating identification model can be a binary classification model which combines text cheating characteristics and link cheating characteristics to determine whether cheating is performed or not. Thereby further improving the accuracy of outer chain identification.

In one possible implementation manner, the step S202 may specifically include:

based on the number of publication units containing outer chains and at least one factor of: an authority value of the content publishing point; the release time of the content data; or the frequency with which the content is distributed by the content distribution point; to calculate a first weight of an edge between the publishing point and an external chain published by itself, and to calculate a second weight of an edge between the first external chain and a second external chain

Specifically, in the process of calculating the first weight and/or the second weight, in addition to considering the number of publishing units including an external chain, the embodiments of the present application may also consider the authority value of the content publishing point; the release time of the content data; or the frequency with which the content is distributed by the content distribution point. That is, the authority value of the content distribution point; the release time of the content data; or at least one factor in the frequency of content distribution by the content distribution point is added to the algorithm as a parameter or a dimension value, weight value, etc. The comprehensiveness of outer chain identification can be increased, and the accuracy of outer chain identification is further improved.

The authority value of the content distribution point may be an authority value determined according to the evaluation information of the content distribution point by the user, or an authority value determined according to the management evaluation information of the content distribution point by the server side.

In order to better implement the above solution of the embodiment of the present application, the present application further provides an external chain identification device, and as shown in fig. 7, the external chain identification device 70 may include: an adjacency graph generating unit 700, a calculating unit 702, a feature generating unit 704 and a recognizing unit 706, wherein

The adjacency graph generating unit 700 is configured to generate an adjacency graph between a distribution point and an external chain according to content data containing the external chain distributed at the content distribution point in a first time period; the distribution point in the adjacency graph is connected with the outer chain distributed by the distribution point, the first outer chain and the second outer chain in the adjacency graph are connected by the edge, and the first outer chain and the second outer chain are two different outer chains respectively distributed in different distribution units of the same distribution point;

the calculating unit 702 is configured to calculate, according to the number of the publishing units including the outer chain, a first weight of an edge between the publishing point and the outer chain published by itself, and a second weight of an edge between the first outer chain and the second outer chain;

the feature generation unit 704 is configured to generate an out-link feature corresponding to the content publishing point based on the adjacency graph, the first weight, and the second weight;

the recognition unit 706 is configured to input the outer chain features into a trained outer chain cheating recognition model to obtain a recognition result; the identification result represents whether the content publishing point publishes the cheating outer chain.

In one possible implementation manner, the calculating unit 702 may specifically be configured to: calculating a second weight of an edge between the first outer chain and the second outer chain according to the number of the issuing units of each issuing point in the issuing point set, wherein the issuing points respectively comprise the first outer chain and the second outer chain; the set of publishing points includes a set of publishing points that publish content data containing the first outer chain and the second outer chain.

In one possible implementation manner, the feature generation unit 704 may specifically be configured to: and generating corresponding out-link characteristics of the content publishing point through a network embedding algorithm based on the adjacency graph, the first weight and the second weight.

the feature generation unit 704 may specifically be configured to:

In one possible implementation manner, the computing unit 702 may specifically be configured to: based on the number of publication units containing outer chains and at least one factor of: an authority value of the content publishing point; the release time of the content data; or the frequency with which the content is distributed by the content distribution point;

Each unit of the external link identification apparatus 70 in this embodiment of the application is configured to correspondingly execute the step executed by the execution device in the external link identification method in the embodiments of fig. 1 to fig. 6 in the above-mentioned methods, and details are not repeated here.

Fig. 8 is a schematic structural diagram of an external chain identification device according to an embodiment of the present application. The outer chain identification device 800 shown in fig. 8 (which may be specifically a computer device) comprises a memory 801, a processor 802, a communication interface 803 and a bus 804. The memory 801, the processor 802, and the communication interface 803 are communicatively connected to each other via a bus 804.

The Memory 801 may be a Read Only Memory (ROM), a static Memory device, a dynamic Memory device, or a Random Access Memory (RAM). The memory 801 may store programs, and when the programs stored in the memory 801 are executed by the processor 802, the processor 802 and the communication interface 803 are used for performing the steps of the out-link identification method of the embodiments of the present application.

The processor 802 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more Integrated circuits, and is configured to execute related programs to execute the out-link identification method according to the embodiment of the present invention.

The processor 802 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the sample generator of the present application may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 802. The processor 802 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 802 reads information in the memory 801 and completes the external link identification method of the embodiment of the application method by combining hardware thereof.

In one possible implementation, the external-link identification device 800 may not include the memory 801, and the processor 802 may obtain a cloud-stored program through the communication interface 803 to execute the steps of the external-link identification method according to the application method embodiment.

The communication interface 803 enables communication between the apparatus 800 and other devices or communication networks using transceiver means such as, but not limited to, transceivers. For example, training data may be acquired through the communication interface 803.

Bus 804 may include a pathway to transfer information between various components of device 800, such as memory 801, processor 802, and communication interface 803.

For specific implementation of each functional device, reference may be made to related descriptions in the above method embodiments, and details are not described in this application embodiment.

In a specific implementation, the external link identification Device may be a terminal or a server, and specifically, its expression form may include various devices that can be used by a user, such as a Mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like, which is not limited in the embodiment of the present invention.

It should be understood that the application scenario to which the method provided in the embodiment of the present application may be applied is only an example, and is not limited to this in practical application.

It should also be understood that the reference to first, second, third and various numerical designations in this application are merely for convenience of description and do not limit the scope of this application.

It should be understood that the term "and/or" in this application is only one type of association relationship that describes the associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this application generally indicates that the former and latter related objects are in an "or" relationship.

In addition, in each embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules and units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, and may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units related to the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in a form of hardware, or may be implemented in a form of software functional unit, which is not limited in this application.

Embodiments of the present invention also provide a computer storage medium having stored therein instructions, which when executed on a computer or processor, cause the computer or processor to perform one or more steps of a method according to any of the above embodiments. Based on the understanding that the constituent modules of the above-mentioned apparatus, if implemented in the form of software functional units and sold or used as independent products, may be stored in the computer-readable storage medium, and based on this understanding, the technical solutions of the present application, in essence, or a part contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of software products, and the computer products are stored in the computer-readable storage medium.

The computer readable storage medium may be an internal storage unit of the device according to the foregoing embodiment, such as a hard disk or a memory. The computer readable storage medium may be an external storage device of the above-described apparatus, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the computer-readable storage medium may include both an internal storage unit and an external storage device of the device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the apparatus. The above-described computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the above embodiments of the methods when the computer program is executed. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device can be merged, divided and deleted according to actual needs.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. An external chain identification method, comprising:

generating corresponding outer chain characteristics of the content publishing point based on the adjacency graph, the first weight and the second weight;

inputting the outer chain characteristics into a trained outer chain cheating recognition model to obtain a recognition result; the identification result represents whether the content publishing point publishes the cheating outer chain.

2. The method of claim 1, wherein the first weight of the edge between the publishing point and the outer chain published by itself is the number of publishing units in the publishing point that contain the outer chain.

3. The method of claim 1, wherein calculating the second weight of the edge between the first outer chain and the second outer chain according to the number of issue units containing the outer chain comprises:

4. The method of claim 3, wherein the second weight of the edge between the first outer chain and the second outer chain is calculated by the following formula:

5. The method of claim 1, wherein the generating the content distribution point corresponding out-link feature based on the adjacency graph, the first weight, and the second weight comprises:

and generating corresponding out-link characteristics of the content publishing point through a network embedding algorithm based on the adjacency graph, the first weight and the second weight.

6. The method of claim 5, wherein the network embedding algorithm comprises a deep walk Deepwalk algorithm;

generating the corresponding out-link characteristics of the content publishing point through a network embedding algorithm based on the adjacency graph, the first weight and the second weight, wherein the generating comprises:

7. The method according to claim 1, wherein the calculating a first weight of an edge between the publishing point and the external link published by itself and a second weight of an edge between the first external link and the second external link according to the number of publishing units having external links comprises:

8. An outer chain identification device, comprising:

9. An out-link identification device comprising a processor configured to invoke stored program instructions to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-7.