CN112417873B

CN112417873B - Automatic cartoon generation method and system based on BBWC model and MCMC

Info

Publication number: CN112417873B
Application number: CN202011221684.XA
Authority: CN
Inventors: 李治江; 应德浩; 李宇涛; 蔡文晖
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2024-02-09
Anticipated expiration: 2040-11-05
Also published as: CN112417873A

Abstract

The invention provides a BBWC model and MCMC-based automatic cartoon generation method and system, which comprises the steps of firstly, performing entity labeling of an expansion range on a Chinese data set; then designing a BERT-BiLSTM+WS-CRF named entity recognition model, and training on the marked data set to recognize seven entities including a name, a place name, a mechanism name, a common noun, a number word, a preposition and an azimuth word so as to obtain information such as foreground object types, background templates, the quantity and position relations of the foreground object types and the background templates; defining different scene templates to describe different scenes for supplementing scene information; then selecting a proper template according to the information obtained before; then controlling the scene layout by an MCMC method to generate complete scene information; using a poisson fusion algorithm to realize seamless fusion of multiple image materials; finally, inputting the text into the final model, and automatically generating a cartoon conforming to the semantics. The invention can reasonably control the size, proportion and position relation of each image; the multi-image material can be seamlessly fused.

Description

Automatic cartoon generation method and system based on BBWC model and MCMC

Technical Field

The invention belongs to the computer and information service technology, and in particular relates to a method and a system for automatically generating cartoon conforming to semantics by recognizing Chinese natural language text.

Background

In modern life, people touch thousands of image information every day from various approaches, but even so, when searching for images, especially when the search term is a long term, it is difficult to obtain images that completely conform to the semantic information of the search term in mainstream search engines. In this case, if understanding of the input sentence can be achieved, an image matching the semantics is directly generated, and more satisfactory results are obtained. If the cartoon is used as a popular image category, the technology such as natural language processing, image synthesis and the like can be applied to automatically generate the cartoon, and the technology is applied to the fields such as preschool education, cartoon printing and publishing, cartoon and the like, the production cost of the cartoon in the fields can be greatly reduced, the personalized requirements of people are met, and a low-threshold, high-efficiency and innovative cartoon manufacturing mode is realized.

There are four main tasks to achieve this goal, natural language processing, image finding, scene placement, and image composition, respectively.

In the natural language processing stage, key scene information in a text, namely a foreground image, a background image, the number and the position relation of the foreground image and the background image, and the like, needs to be extracted, and the process is mainly completed through a named entity recognition task, and a general named entity recognition task is mainly used for recognizing the name, the place name and the organization name in a sentence.

At present, the mainstream named entity recognition models are all based on deep learning to complete recognition tasks, the models develop rapidly after being proposed from CNN and CRF-based models, and after LSTM and CRF-based models, biLSTM and CRF-based models and even other models capable of utilizing information in corpus are combined on the basis, good effects are obtained on English and other foreign named entity recognition tasks, but the effects are always inferior to English when similar models are used for carrying out named entity recognition tasks on Chinese corpus. The main reason for this difference is that most of the named entity recognition methods based on deep learning use word2vec tools to pretrain the text, and word vector input neural networks are obtained to reduce the cost of manually extracting features. However, in the process of performing the task of identifying the named entities of the Chinese, the meaning of the words is greatly influenced by the context because the Chinese vocabulary has rich word ambiguity. If word2vec tools are used, the resulting word vectors for the same word are identical, regardless of context, which presents difficulties in identifying a word ambiguous vocabulary. And the prefix information, the letter case information and the space information among words which are utilized in some models are all absent in Chinese.

Meanwhile, aiming at the named entity identification in the automatic cartoon generation task, the information representing the number and the position relation of the entities in the sentence is mainly contained in the number words, prepositions and azimuth words, and the common nouns which appear in a large amount in the text are also important parts forming the semantics, so that the common nouns, the number words, the prepositions and the azimuth words are used as special entities for identification on the basis of three conventional entities of identifying person names, place names and organization names. However, in general, many words in the regular entity are common words in the special entity, so that the identified boundaries of the regular entity and the special entity are easily overlapped, and the entity boundary identification errors are caused, so that the integrity of the regular entity is destroyed.

In the image searching stage, information obtained by processing text information in the natural language processing stage is input, mainly background information and foreground information, and corresponding foreground images and background images, and foreground image information and background image information are selected according to the information. Wherein the background information is only background image name, and the foreground information comprises foreground object name and foreground object number. Further, since complete scene information needs to be generated in the scene arranging stage, image information such as the size of a picture, content information contained in an image, and the like need to be extracted in particular.

In the scene arrangement stage, the scene information obtained by the natural language processing stage still cannot meet the generation of scenes, and in order to continue to supplement the scene information, the scene information needs to be solved by defining templates and matching templates, and meanwhile, before image synthesis, the relation between a foreground object and a background area, the position relation among objects and the like need to be controlled, namely, the position and size information of each image in a new image need to be determined.

In the image composition stage, it is ensured that the final composition result is a seamless image, and the problem of discontinuous brightness conversion between input images caused by illumination variation is mainly solved. The more common image fusion method is weighted average fusion, and the idea is to assign weights with different weights according to the positions of the input images to complete the fusion. The method is simple to implement, but the detailed information of part of the original image can be lost.

Disclosure of Invention

Because of the ambiguity of a word with rich Chinese vocabulary, different word vectors of the same word under different contexts cannot be obtained by using a word2vec tool; extraction of common nouns, numbers, prepositions and azimuth words plays an important role in scene arrangement, but after an identification task is added, entity boundaries are easy to identify errors, and the integrity of conventional entities is easy to be damaged; the scene information obtained in the natural language processing stage is still insufficient to meet the generation of scenes in the scene arrangement stage; the image synthesis method using weighted average fusion can lose detail information of part of original images; in order to overcome the defects of the prior art, the invention provides an automatic cartoon generation method based on a (BERT-BiLSTM+WS-CRF) BBWC model and MCMC. The method builds a BERT-BiLSTM+WS-CRF named entity recognition model in a natural language processing stage, and uses the number words, prepositions and azimuth words containing the number of the entities and the position information as a special entity, and uses the special entity, the name of a person, the name of a place, the name of an organization and the common noun as recognition objects for Chinese named entity recognition. Wherein the BERT layer computes their vector representations using the context of each word in the context, thereby modeling the word ambiguities, especially for words in the regular entity that are easily identified as special entities, correctly identifying their components in the sentence, avoiding destroying the integrity of the regular entity; the WS-CRF layer takes the word segmentation task as an auxiliary constraint condition so as to help improve the accuracy of identifying the boundaries of the conventional entity and the special entity; finally, the label sequence with the highest probability is obtained through the result of weighting the WS-CRF layer and the BiLSTM-CRF layer in a certain proportion; defining templates in advance in a scene arrangement stage, matching through the extraction result of the previous stage, complementing scene information, and controlling layout by using an MCMC method, wherein the size, proportion and position of objects in a picture accord with semantics and normal theory; and finally, in the image synthesis stage, a poisson image fusion method is used, and the details are reserved while the object and the background are fused by using a gradient field. Finally, the invention not only can fully utilize the context information in the Chinese text to accurately extract the entity and match the scene template information, but also can optimize the layout of foreground objects in the scene to obtain the position and proportion of each object, and generate a seamless fused cartoon image.

The technical scheme of the invention is an automatic cartoon generating method based on a BBWC model and MCMC, which comprises the following steps:

step 1, labeling names, place names, organization names, common nouns, numbers, prepositions and azimuth words in a Chinese data set consisting of fairy tales, alleys and novels, and training a named entity recognition model in a natural language processing stage;

step 2, constructing and training an automatic cartoon generation model based on BBWC and MCMC, which comprises the following steps:

step 2.1, designing a BERT-BiLSTM+WS-CRF named entity recognition model, and training on the training set marked in the step 1 to recognize seven types of entities including person names, place names, organization names, common nouns, digital words, prepositions and azimuth words so as to obtain foreground object types, background template information and quantity and position relation information of the foreground object types and the background template information;

step 2.2, an image data set is established, and a corresponding foreground image is selected from the image data set according to the foreground object type and the background template information obtained in the step 2.1;

step 2.3, defining different scene templates to describe different scenes, supplementing scene information, specifying background information and default foreground information in the scenes, and providing an initial position range for the subsequent scene arrangement;

Step 2.4, selecting a proper template from the templates defined in the step 2.3 according to the background template information obtained in the step 2.1;

step 2.5, controlling scene layout by an MCMC method, determining the size, proportion and position relation of each image in the foreground image, and generating complete scene information;

step 2.6, using a poisson fusion algorithm to realize seamless fusion of multiple image materials and reduce fusion marks of image synthesis;

and step 3, inputting the text into a trained automatic cartoon generation model based on BBWC and MCMC, and automatically generating a cartoon conforming to the semantics.

Further, the labeling method in the step 1 is as follows: "B" represents the initial character of an entity, "I" represents the character of the entity other than the initial character, and "O" represents the non-entity character. Then, for a specific entity class, "PER" is used to represent a person name, "LOC" is used to represent a place name, "ORG" is used to represent an organization name, "N" is used to represent a common noun, "M" is used to represent a numeral, "P" is used to represent a preposition, and "F" is used to represent an orientation. Therefore, the labels of the characters in the dataset are classified into fifteen categories of "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG", "B-N", "I-N", "B-M", "I-M", "B-P", "I-P", "B-F", "I-F", "O".

Further, the BERT-BiLSTM+WS-CRF named entity recognition model described in step 2.1 includes a BERT layer, a BiLSTM-CRF layer and a WS-CRF layer. The BERT model is used as a feature representation layer to obtain word vector representation related to corpus context, further extract text global features and local features, input the result into a WS-CRF layer and a BiLSTM-CRF layer which are in parallel to obtain labels, and finally obtain a label sequence with the maximum probability through weighting in a certain proportion. Setting the input sequence of the CRF layer as H= (H) in the BiLSTM-CRF layer ₁ ,h ₂ ,...,h _n ) Where n is the number of words in the input sentence and the corresponding output tag sequence is y= (y) ₁ ,y ₂ ,...,y _n ) Let a state transition matrix A between BiLSTM and CRF be the parameter of CRF layer, i.e. transition score matrix, where A _i,j For the score of the transition from the label i to the label j, setting the score matrix of the CRF output layer as P, wherein P _i,j Is the output score of the ith word under the jth tag in an input sequence. The tag sequence scores:

normalizing all possible sequences to obtain probability distribution of an output sequence y as follows:

wherein Y is _H The loss function of the BiLSTM-CRF layer is therefore the set of all possible tag sequences for the input sequence of the CRF:

L _CRF ＝-log(p(y|H))

Let the input of WS-CRF layer be B= (B) ₁ ,b ₂ ,...,b _n ) The output of BERT layer, the label sequence of output is y _ws ＝(w ₁ ,w ₂ ,...,w _n ) Therefore, the WS-CRF layer loss function is:

therefore, the loss function of the BBWC model as a whole is set as:

wherein alpha is a weighting coefficient, and the larger the value of alpha is, the larger the reference degree of the model on the loss function of the WS-CRF layer is, namely the larger the influence of the word segmentation task result on the final recognition result is. After multiple experiments, the entity recognition effect is optimal when alpha is set to be 0.3, so that the loss function of the final BBWC model is as follows:

further, the template defined in step 2.3 specifies the background and default foreground objects in the scene, wherein the background of all scenes consists of several of sky, ground, grass, lake, road. In addition, to supplement the scene information, each template simultaneously gives the type and number of foreground objects contained in the scene by default.

Further, the selection of templates in step 2.4 includes two different cases. If the information extracted in the step 2.1 directly contains a defined template, selecting a matching template; if the extracted information does not directly contain the defined template, a suitable template is selected based on the processed information, and if the information extracted in step 2.1 is contained in an object that may appear in a template, it is a suitable template.

Further, the method for controlling the scene layout by the MCMC method in step 2.5 is as follows:

the state transition sequence of the constructed scene layout is as follows:

P(X _n+1 ＝x|X ₁ ＝x ₁ ，X ₂ ＝x ₂ ，…，X _n ＝x _n )＝P(X _n+1 ＝x|X _n ＝x _n )

based on Metropolis-Hastings algorithm, the position and size data of all foreground objects are formed into the current state X _t ：

X _t ＝{R ₁ ，R ₂ ，…，R _n }

Where x and y are coordinates of the center of each foreground object, and w and h are the width and height of each foreground object, respectively.

R _i ＝(x _i ，y _i ，w _i ，h _i )

For each time t, state X is represented using P (X) _t Probability under, where k ₁ 、k ₂ 、k ₃ The weight coefficients are artificially defined according to the needs:

P(X)＝k ₁ U+k ₂ G+k ₃ F

wherein U represents the covered condition of the foreground object, N represents the number of the foreground objects, U represents the covered area, s represents the object area, and t is a weight manually set according to the importance degree:

controlling the position relation between the foreground object and the background area through G, wherein M represents the number of the background areas, G represents the overlapping area of the foreground object and the background object, and r is a manually defined parameter for representing the association degree of the foreground object and the background object:

limiting the proportion of a single object to a picture through F limitation, wherein r represents a proportion value in the current state, and r _n The appropriate ratio when n foreground objects are present is set for the person:

calculating the probability P (X) of the state X, and adjusting parameters by randomly enlarging, reducing or moving the foreground object or obtaining a new state X ^* And calculates the transition probability q (X ^* |X)：

Or:

wherein W and H are the width and height of the generated scene, x _i And y _i Representing the i-th foreground object center coordinates.

At this time, a transition from state X to state X can be calculated ^* Is a) the acceptance probability α:

then randomly generating a real number u in the range of (0, 1), and accepting the transition if u < alpha. When P (X) ^* ) When P (X), the new state is the more proper result, the state transitions to the new state. The state is distributed smoothly after repeated for many times, and the state is the optimized result.

Further, in step 2.6, for discrete digital images, gradients are calculated for each color channel of the source image in the horizontal and vertical directions, respectively, whereinThe gradient values of the source image in the horizontal direction and the vertical direction are respectively, and I and j are respectively the abscissa and the ordinate of the digital image I:

and calculating partial derivatives of the gradient fields to obtain the divergence of the composite image, and establishing a poisson equation to solve to obtain a fusion image.

The invention also provides an automatic cartoon generating system based on the BBWC model and the MCMC, which comprises the following modules:

the training set labeling module is used for labeling names, place names, organization names, common nouns, numbers, prepositions and azimuth words in a Chinese data set consisting of fairy tales, morals and novels so as to train a named entity recognition model in a natural language processing stage;

The model construction and training module is used for constructing and training an automatic cartoon generation model based on BBWC and MCMC, and comprises the following sub-modules:

the natural language processing sub-module is used for designing a BERT-BiLSTM+WS-CRF named entity recognition model, training on the training set marked by the training set marking module and identifying seven types of entities including personal names, place names, mechanism names, common nouns, numbers, prepositions and azimuth words so as to obtain foreground object types, background template information and quantity and position relation information of the foreground object types, the background template information and the background template information;

the image searching sub-module is used for establishing an image data set by using database management software, and selecting a corresponding foreground image from the image data set according to the foreground object type and the background template information obtained by the natural language processing sub-module;

the template defining sub-module is used for defining different scene templates to describe different scenes, supplementing scene information, specifying background information and default foreground information in the scenes and providing an initial position range for the subsequent scene arrangement;

the template matching submodule is used for selecting a proper template from templates defined by the template definition submodule according to the background template information obtained by the natural language processing module;

The scene arrangement sub-module is used for controlling scene layout through an MCMC method, namely determining the size, proportion and position relation of each image in the foreground image, and generating complete scene information;

the image synthesis submodule is used for realizing seamless fusion of multiple image materials by using a poisson fusion algorithm, and reducing fusion marks of image synthesis;

the automatic cartoon generation module is used for inputting the text into the trained BBWC and MCMC-based automatic cartoon generation model to automatically generate a cartoon conforming to the semantics.

Further, the BERT-BiLSTM+WS-CRF named entity recognition model in the natural language processing submodule comprises a BERT layer, a BiLSTM-CRF layer and a WS-CRF layer. Which is a kind ofThe BERT model is used as a feature representation layer to obtain word vector representation related to corpus context, further extract text global features and local features, input the result into a WS-CRF layer and a BiLSTM-CRF layer which are in parallel to obtain labels, and finally obtain a label sequence with the maximum probability through weighting in a certain proportion. Setting the input sequence of the CRF layer as H= (H) in the BiLSTM-CRF layer ₁ ,h ₂ ,...,h _n ) Where n is the number of words in the input sentence and the corresponding output tag sequence is y= (y) ₁ ,y ₂ ,...,y _n ) Let a state transition matrix A between BiLSTM and CRF be the parameter of CRF layer, i.e. transition score matrix, where A _i,j For the score of the transition from the label i to the label j, setting the score matrix of the CRF output layer as P, wherein P _i,j Is the output score of the ith word under the jth tag in an input sequence. The tag sequence scores:

L _CRF ＝-log(p(y|H))

let the input of WS-CRF layer be B= (B) ₁ ，b ₂ ，...，b _n ) The output of BERT layer, the label sequence of output is y _ws ＝(w ₁ ，w ₂ ，...，w _n ) Therefore, the WS-CRF layer loss function is:

therefore, the loss function of the BBWC model as a whole is set as:

further, the templates described in the template definition sub-module specify background and default foreground objects in the scene, wherein the background of all scenes is composed of several of sky, ground, grass, lake, road. In addition, to supplement the scene information, each template simultaneously gives the type and number of foreground objects contained in the scene by default.

Further, the selection of templates described in the template matching sub-module includes two different cases. If the information extracted in the step 2 directly contains a defined template, selecting a matching template; if the extracted information does not directly contain the defined template, a suitable template is selected according to the information obtained by processing, and if the information extracted in the step 2 is contained in the object which possibly appears in one template, the information is a suitable template.

Further, the method for controlling the scene layout by the MCMC method in the scene layout sub-module is as follows:

the state transition sequence of the constructed scene layout is as follows:

based on Metropolis-HaThe stings algorithm forms the position and size data of all foreground objects into the current state X _t ：

X _t ＝{R ₁ ，R ₂ ，…，R _n }

Where x and y are coordinates of the center of each foreground object, and w and h are the width and height of the foreground object, respectively.

R _i ＝(x _i ，y _i ，w _i ，h _i )

P(X)＝k ₁ U+k ₂ G+k ₃ F

the probability P (X) of the state X is calculated,and adjusting parameters by randomly zooming in, zooming out or moving the foreground object or obtaining a new state X ^* And calculates the transition probability q (X ^* |X)：

Or:

wherein W and H represent the width and height of the claimed scene, x _i And y _i Representing the center coordinates of the object.

Further, the poisson fusion algorithm described in the image synthesis submodule calculates gradients for the discrete digital images for the horizontal and vertical directions of the color channels of the source image, respectively, wherein The gradient values of the source image in the horizontal direction and the vertical direction are respectively, and I and j are respectively the abscissa and the ordinate of the digital image I:

Compared with the prior art, the invention has the advantages and beneficial effects that:

a) The model fully utilizes the context information, so that ambiguity of Chinese characters can be well represented in the process of identifying the model.

b) And the model adds word segmentation task constraint recognition results, so that entity boundary recognition accuracy is improved.

c) The model enriches the scene information enough by using well-defined templates.

d) The model can reasonably control the size, proportion and position relation of each image.

e) The model enables seamless fusion of multiple image materials.

Drawings

FIG. 1 is a system flow block diagram;

FIG. 2 is a diagram of a model structure of a natural language processing stage;

fig. 3 is an example effect diagram, wherein the picture is an output caricature image, and the picture is an input text.

Detailed Description

The invention belongs to the computer and information service technology, and in particular relates to a method and a system for automatically generating cartoon conforming to semantics by recognizing Chinese natural language text. The invention provides an automatic cartoon generation method based on a BBWC model and an MCMC, and the method can automatically generate the cartoon conforming to the input text semantics. The system flow structure diagram is shown in figure 1.

The invention can be realized by using a computer to train and infer a network and using a Tensorflow deep learning framework under a windows operating system. The specific experimental environment is configured as follows:

step 1, labeling names, places, organization names, common nouns, numbers, prepositions and azimuth words in a Chinese data set consisting of fairy tales, moras and novels, and training a named entity recognition model in a natural language processing stage, wherein the specific implementation process of the embodiment is as follows:

the fairy tales are Chinese translation texts of green fairy tales, the alleys are Chinese translation texts of Issuo alleys, and the novels are Chinese novels of trisomy. And labeling the name, place name, organization name, common noun, number word, preposition and azimuth word in the data set. The notation format "B" indicates the start character of an entity, "I" indicates the characters of the entity other than the start character, and "O" indicates the non-entity character. Then, for a specific entity class, "PER" is used to represent a person name, "LOC" is used to represent a place name, "ORG" is used to represent an organization name, "N" is used to represent a common noun, "M" is used to represent a numeral, "P" is used to represent a preposition, and "F" is used to represent an orientation. Therefore, the labels of the characters in the dataset are classified into fifteen categories of "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG", "B-N", "I-N", "B-M", "I-M", "B-P", "I-P", "B-F", "I-F", "O".

Wherein the labeling quantity of each entity is shown in the following table:

name of person	Place name	Organization name	Common nouns	Number words	Preposition	Azimuth word	Totalizing
								16211	2174	1235	171774	27121	43092	20997	282604

step 2.1, designing a BERT-BiLSTM+WS-CRF named entity recognition model, training on the training set marked in the step 1, and identifying seven types of entities including person names, place names, organization names, common nouns, digital words, prepositions and azimuth words, so as to obtain information such as foreground object types, background types, the quantity and position relations of the foreground object types and the background types, wherein the specific implementation process of the embodiment is as follows:

firstly, in the BERT layer, inputting the whole sentence into parallel calculation, and calculating to obtain an attribute matrix of the input sentence by using a self-attribute mechanism of a core Transformer of the BERT, wherein each row in the matrix represents an attribute vector of each word in the input sequence. The pre-training phase of BERT is a multi-task learning model including MLM and NSP, and then the penalty function of BERT includes two parts from MLM and NSP, respectively:

then, the loss function they jointly learn is:

L(θ,θ ₁ ,θ ₂ )＝L(θ,θ ₁ )+L(θ ₁ ,θ ₂ )

Next, the BERT layer results are input into the parallel BiLSTM-CRF layer and WS-CRF layer. Forget gate in BiLSTM (f) _t ) Let the cyclic neural network forget the unimportant information in the previous memory unit, f _t From the current input x _t Output h at last moment _t-1 And state C of the memory cell at the previous time _t-1 Determining together; input door (i) _t ) Supplement the latest memory, by x _t 、h _t-1 And C _t-1 State C of memory unit for deciding information input to current moment _t The method comprises the steps of carrying out a first treatment on the surface of the Output door (o) _t ) Determining the output h at that time _t From C _t 、h _t-1 And x _t And (5) jointly determining. The specific calculation process of the method at the time t is as follows:

C _t ＝f _t C _t-1 +i _t tanh(W _x Cx _t +W _hC h _t-1 +b _C )

o _t ＝σ(W _xo x _t +W _ho h _t-1 +W _Co C _t +b _o )

h _t ＝o _t tanh(C _t )

wherein W represents a link LSTMThe subscript of the weight matrix of which layer indicates in particular which layer is connected, σ and tanh are the two neuron activation functions, respectively, b indicates the bias vector, and the subscript indicates in particular which layer is the bias vector. After realizing LSTM, a reverse LSTM is added to realize bidirectional LSTM, namely BiLSTM. The forward LSTM obtains the characteristic of the history information before the time t by using vectorsThe reverse LSTM is used to derive the future information characteristics after time t, using the vector +.>Representing, the integrated feature representation vector is +.>Then, setting the input sequence of CRF as H= (H) in BiLSTM-CRF layer ₁ ,h ₂ ,...,h _n ) Where n is the number of words in the input sentence and the corresponding output tag sequence is y= (y) ₁ ,y ₂ ,...,y _n ) Let a state transition matrix A between BiLSTM and CRF be the parameter of CRF layer, i.e. transition score matrix, where A _i,j For the score of the transition from the label i to the label j, setting the score matrix of the CRF output layer as P, wherein P _i,j Is the output score of the ith word under the jth tag in an input sequence. The tag sequence scores:

wherein Y is _H A set of all possible tag sequences for the input sequence of the CRF,therefore, the loss function of the BiLSTM-CRF layer is:

L _CRF ＝-log(p(y|H))

meanwhile, let the input of WS-CRF layer be B= (B) ₁ ,b ₂ ,...,b _n ) The output of BERT layer, the label sequence of output is y _ws ＝(w ₁ ,w ₂ ,...,w _n ) Therefore, the WS-CRF layer loss function is:

therefore, the loss function of the BBWC model as a whole is set as:

And 2.2, establishing an image data set based on MySQL to manage image resources, wherein the data stored in the data set comprises source images of foreground images, binary Mask images corresponding to all foreground objects used for fusing Poisson images, source images of background images defined in a background template and foreground object information contained in the background template. According to the foreground object type and the background type information obtained in the step 2.1, the foreground object type and the background type information are used for selecting corresponding foreground images and background templates in the data set, and the specific implementation process of the embodiment is as follows:

in this stage, the image searching stage inputs information obtained by processing text information in the natural language processing stage, mainly foreground object type and background type, and selects corresponding foreground image and background template from the data set according to the information. The foreground information comprises foreground object names and the number of foreground objects, and the specified number of images are randomly selected from the foreground image set. The background template comprises corresponding background images and foreground object images possibly appearing in the corresponding scene.

Step 2.3, defining different scene templates to describe different scenes, for supplementing scene information, specifying background information and default foreground information in the scene, and providing an initial position range for subsequent scene arrangement, wherein the specific implementation process of the embodiment is as follows:

The defined templates specify the background and default foreground objects in the scene, where the background of all scenes consists of several of sky, ground, grass, lake, road. In addition, to supplement the scene information, each template simultaneously gives the type and number of foreground objects contained in the scene by default.

Step 2.4, selecting a proper template from the templates defined in step 2.3 according to the background template information obtained in step 2.1, wherein the specific implementation process of the embodiment is as follows:

if the defined background template type names in the image dataset have a direct corresponding relation with the background type information extracted in the step 2.1, selecting templates with the same names as matching templates; if the extracted information does not directly contain the defined template, a suitable template is selected according to the information obtained by processing, i.e. if the foreground object type extracted in step 2.1 is contained in the objects which may appear in one template.

Step 2.5, controlling the scene layout by an MCMC method according to the information obtained in step 2.2 and step 2.4, that is, determining the size, proportion and position relation of each image, and generating complete scene information, where the specific implementation process of the embodiment is as follows:

The state transition sequence of the constructed scene layout is as follows:

X _t ＝{R ₁ ，R ₂ ，…，R _n }

Where x and y are coordinates of the center of the object and w and h are the width and height of the object, respectively.

R _i ＝(x _i ，y _i ，w _i ，h _i )

P(X)＝k ₁ U+k ₂ G+k ₃ F

Or:

Step 2.6, according to the information obtained in step 2.5, using a poisson fusion algorithm to realize seamless fusion of multiple image materials and reduce fusion marks of image synthesis, wherein the specific implementation process of the embodiment is as follows:

for discrete digital images, gradients are calculated for each color channel of the source image in the horizontal and vertical directions, respectively, where The gradient values of the source image in the horizontal direction and the vertical direction are respectively, and I and j are respectively the abscissa and the ordinate of the digital image I:

And 3, inputting the text into an automatic cartoon generation model based on BBWC and MCMC, and automatically generating a cartoon conforming to the semantics.

The embodiment of the invention also provides an automatic cartoon generating system based on the BBWC model and the MCMC, which comprises the following modules:

the natural language processing sub-module is used for designing a BERT-BiLSTM+WS-CRF named entity recognition model, training on the training set marked by the training set marking module and identifying seven types of entities including personal names, place names, mechanism names, common nouns, numbers, prepositions and azimuth words so as to obtain information such as foreground images, background images, the quantity and position relations of the foreground images and the background images;

the image searching sub-module establishes an image data set to manage image resources, and is used for selecting corresponding foreground images and background images, foreground image information and background image information in the data set according to the information obtained by the natural language processing sub-module;

the template matching sub-module is used for selecting a proper template from templates defined by the template definition sub-module according to the information obtained by the natural language processing sub-module;

the scene arrangement sub-module is used for controlling scene layout through an MCMC method, namely determining the size, proportion and position relation of each image and generating complete scene information;

and the automatic cartoon generation module is used for inputting the text into an automatic cartoon generation model based on BBWC and MCMC, and automatically generating a cartoon conforming to the semantics.

The specific implementation of each module corresponds to each step, and is not described in this embodiment.

The input text, automatically generated caricature image in this example is shown in fig. 3.

The specific embodiments described herein are offered by way of example only. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims

1. The automatic cartoon generation method based on the BBWC model and the MCMC is characterized by comprising the following steps of:

the BERT-BiLSTM+WS-CRF named entity recognition model in the step 2.1 comprises a BERT layer, a BiLSTM-CRF layer and a WS-CRF layer, wherein BERT is used as a feature representation layer to obtain word vector representation related to corpus context so as to extract text global features and local features, then the result is input into the parallel WS-CRF layer and BiLSTM-CRF layer to obtain labels, and finally a label sequence with the maximum probability is obtained through weighting in a certain proportion; setting the input sequence of the CRF layer as H= (H) in the BiLSTM-CRF layer ₁ ,h ₂ ,...,h _n ) Where n is the number of words in the input sentence and the corresponding output tag sequence is y= (y) ₁ ,y ₂ ,...,y _n ) Let a state transition matrix A between BiLSTM and CRF be the parameter of CRF layer, i.e. transition score matrix, where A _i,j For the score of the transition from the label i to the label j, setting the score matrix of the CRF output layer as P, wherein P _i,j In the input sequence, the output score of the ith word under the jth label is given, and the label sequence score is:

L _CRF ＝-log(p(y|H))

therefore, the loss function of the BBWC model as a whole is set as:

wherein alpha is a weighting coefficient, the larger the value of alpha is, the larger the reference degree of the model on the loss function of the WS-CRF layer is, namely the larger the influence of the word segmentation task result on the final recognition result is;

the selection of templates in step 2.4 includes two different cases: if the defined background template type names in the image dataset have a direct corresponding relation with the background type information extracted in the step 2.1, selecting templates with the same names as matching templates; if the extracted information does not directly contain the defined template, selecting a proper template according to the information obtained by processing, namely, if the foreground object type extracted in the step 2.1 is contained in the objects possibly appearing in one template, the template is a proper template;

2. The automatic cartoon generating method based on BBWC model and MCMC according to claim 1, wherein: the labeling method in the step 1 is as follows: "B" represents the initial character of the entity, "I" represents the character of the entity other than the initial character, and "O" represents the non-entity character; then, for a specific entity category, using PER to represent a person name, LOC to represent a place name, ORG to represent an organization name, N to represent a common noun, M to represent a numeral, P to represent a preposition, and F to represent an azimuth; therefore, the labels of the characters in the dataset are classified into fifteen categories of "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG", "B-N", "I-N", "B-M", "I-M", "B-P", "I-P", "B-F", "I-F", "O".

3. The automatic cartoon generating method based on BBWC model and MCMC according to claim 1, wherein: the templates defined in step 2.3 define the background and default foreground objects in the scene, wherein the background of all scenes consists of several of sky, ground, grass, lakes, roads, and in addition, each template gives the type and number of foreground objects contained by default in the scene at the same time for supplementing the scene information.

4. The automatic cartoon generating method based on BBWC model and MCMC according to claim 1, wherein: the method for controlling the scene layout by the MCMC method in the step 2.5 is as follows:

the state transition sequence of the constructed scene layout is as follows:

P(X _n+1 ＝x|X ₁ ＝x ₁ ,X ₂ ＝x ₂ ,…,X _n ＝x _n )＝P(X _n+1 ＝x|X _n ＝x _n )

X _t ＝{R ₁ ,R ₂ ,…,R _n }

Wherein x and y are coordinates of the center of each foreground object, and w and h are the width and height of each foreground object respectively;

R _i ＝(x _i ,y _i ,w _i ,h _i )

P(X)＝k ₁ U+k ₂ G+k ₃ F

limiting the proportion of a single object to a picture through F limitation, wherein r represents a proportion value in the current state, and r _n Manually set the ratio when n foreground objects are present:

Or:

wherein W and H are the width and height of the generated scene, x _i And y _i Representing the center coordinates of the ith foreground object;

at this time, the calculation is transferred from state X to state X ^* Is a) the acceptance probability α:

then randomly generating a real number u in the range of (0, 1), if u<Alpha, accepting the transfer; when P (X) ^* )>P (X) shows that the new state is a more proper result, the state is transferred to the new state, the acceptance probability alpha is larger at the moment, the state with higher probability can be transferred, the state can reach stable distribution after repeated for a plurality of times, and the state is the optimized result at the moment.

5. The automatic cartoon generating method based on BBWC model and MCMC according to claim 1, wherein: in step 2.6, for discrete digital images, gradients are calculated for each color channel of the source image in the horizontal and vertical directions, respectively, whereinThe gradient values of the source image in the horizontal direction and the vertical direction are respectively, and I and j are respectively the abscissa and the ordinate of the digital image I:

6. An automatic cartoon generation system based on a BBWC model and an MCMC is characterized by comprising the following modules:

the natural language processing sub-module is used for designing a BERT-BiLSTM+WS-CRF named entity recognition model, training the training set marked by the training set marking module and identifying seven types of entities including personal names, place names, mechanism names, common nouns, numbers, prepositions and azimuth words so as to obtain foreground object types, background template information and quantity and position relation information of the foreground object types, the background template information and the background template information;

the BERT-BiLSTM+WS-CRF named entity recognition model in the natural language processing submodule comprises a BERT layer, a BiLSTM-CRF layer and a WS-CRF layer, wherein the BERT model is used as a feature expression layer to obtain word vector expression related to corpus context so as to extract text global features and local features, then the result is input into the parallel WS-CRF layer and BiLSTM-CRF layer to obtain labels, and finally a label sequence with the maximum probability is obtained through weighting in a certain proportion; setting the input sequence of the CRF layer as H= (H) in the BiLSTM-CRF layer ₁ ,h ₂ ,...,h _n ) Where n is the number of words in the input sentence and the corresponding output tag sequence is y= (y) ₁ ,y ₂ ,...,y _n ) Let a state transition matrix A between BiLSTM and CRF be the parameter of CRF layer, i.e. transition score matrix, where A _i,j For the score of the transition from the label i to the label j, setting the score matrix of the CRF output layer as P, wherein P _i,j The score of the tag sequence is:

L _CRF ＝-log(p(y|H))

let the input of WS-CRF layer be B= (B) ₁ ,b ₂ ,...,b _n ) The output of BERT layer, the label sequence of output is y _ws ＝(w ₁ ,w ₂ ,...,w _n ) Thus the loss function of WS-CRF layer is：

Therefore, the loss function of the BBWC model as a whole is set as:

the image searching sub-module is used for establishing an image data set, and selecting a corresponding foreground image from the image data set according to the foreground object type and the background template information obtained by the natural language processing sub-module;

the template matching sub-module is used for selecting a proper template from templates defined by the image searching module according to the background template obtained by the natural language processing module;

the selection of templates includes two different cases: if the defined background template type names in the image data set and the background type information extracted from the natural language processing submodule have a direct corresponding relation, selecting templates with the same names as matching templates; if the extracted information does not directly contain the defined template, selecting a proper template according to the information obtained by processing, namely, if the object possibly appearing in one template contains the foreground object type extracted from the natural language processing submodule, the template is a proper template;

The image synthesis submodule realizes seamless fusion of multiple image materials by using a poisson fusion algorithm, and reduces fusion marks of image synthesis;

7. The automated comic generation system based on BBWC model and MCMC according to claim 6 wherein: the method for controlling the scene layout by the MCMC method in the scene layout sub-module is as follows:

the state transition sequence of the constructed scene layout is as follows:

X _t ＝{R ₁ ,R ₂ ,…,R _n }

R _i ＝(x _i ,y _i ,w _i ,h _i )

P(X)＝k ₁ U+k ₂ G+k ₃ F

Or: