CN113283238B

CN113283238B - Text data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113283238B
Application number: CN202110547645.7A
Authority: CN
Inventors: 杨康; 徐凯波; 孙泽懿; 徐成国; 王硕
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2023-12-22
Anticipated expiration: 2041-05-19
Also published as: CN113283238A

Abstract

The application provides a text data processing method and device, a storage medium and an electronic device, wherein the method comprises the following steps: the method comprises the steps of obtaining chat records stored in interaction software, wherein the interaction software is used for recording communication information of a target account; extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and Guan Jici; classifying the plurality of key phrases by utilizing a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold value; and encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles. The method and the device solve the problem that the time spent for manually finishing the work to summarize the text data is more in the related art.

Description

Text data processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of machine learning, and in particular, to a method and apparatus for processing text data, an electronic device, and a storage medium.

Background

With the development of the internet, network social tools (such as WeChat, enterprise WeChat and the like) have become more and more popular in daily life and work of people, and meanwhile, great convenience is brought to the life and work of people. Meanwhile, with the circulation of a large amount of information, effective filtering of the information and arrangement of the information have become topics of high attention to the society at present. Particularly, in the daily working process, the method inevitably joins in a lot of group chat or communicates with a lot of related people, so that a lot of information about the work is received every day, meanwhile, in the many working processes, the work summary is inevitably required to be arranged, a lot of chat data is filtered and summarized, and then the work summary is arranged into a matter which takes time, and a lot of key nodes and details are easily omitted.

Therefore, the related art has the problems that the time for manually sorting the work summary text data is more and the missing of the work content frequently occurs.

Disclosure of Invention

The application provides a text data processing method and device, a storage medium and electronic equipment, and aims to at least solve the problems that in the related technology, the time spent for manually finishing work to summarize text data is more and missing work content frequently occurs.

According to an aspect of an embodiment of the present application, there is provided a method of text data processing, the method including: the method comprises the steps of obtaining chat records stored in interaction software, wherein the interaction software is used for recording communication information of a target account, and the target account is an account used in the interaction software; extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and the related words; classifying the plurality of key phrases by using a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between phrases in the target phrase sets is larger than a preset threshold; and encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles.

According to another aspect of the embodiments of the present application, there is also provided an apparatus for processing text data, including: the system comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is used for acquiring chat records stored in interaction software, the interaction software is used for recording communication information of a target account, and the target account is an account used in the interaction software; the extraction unit is used for extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and the related words; the classification unit is used for classifying the plurality of key phrases by utilizing a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold value; and the coding unit is used for coding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles.

Optionally, the classification unit includes: the acquisition module is used for acquiring the time information corresponding to the chat record; the first determining module is used for determining a preset step length for dividing the time information, wherein the preset step length is a fixed value; the first dividing module is used for dividing the time information by utilizing the preset step length to obtain a plurality of target phrase sets.

Optionally, the acquiring module includes: an acquisition subunit, configured to acquire the number information of the chat records; the calculating subunit is used for carrying out average calculation on the quantity information to obtain average value information; and the setting subunit is used for taking the mean value information as the preset step length.

Optionally, the classification unit includes: the sequencing module is used for sequencing the time information according to the time sequence to obtain a sequencing result; and the second dividing module is used for dividing a first chat record with the time difference between two adjacent time information in the sequencing result being smaller than or equal to a preset difference value into a first target phrase set, and dividing a second chat record except the first chat record into a second target phrase set, wherein the first target phrase set and the second target phrase set are subsets of the target phrase set.

Optionally, the apparatus further comprises: a first dividing unit, configured to divide the second chat record into the first target phrase set if it is determined that the degree of association between the entity in the second chat record and the entity in the first chat record is greater than or equal to the preset threshold; and the second dividing unit is used for dividing the second chat record into the second target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is smaller than the preset threshold value.

Optionally, the classification unit includes: the matching module is used for matching the entity in the keyword group with a preset item byte by utilizing a byte matching scheme, wherein the preset item byte is used for indicating an item to which the entity belongs; the attribution module is used for attributing the entity to a target item corresponding to the target item byte to obtain the target phrase set under the condition that a matching result between the target item byte and the entity is larger than a preset matching threshold value in the preset item byte, wherein the entity in one target phrase set is attributed to the same item, and the target item byte is any item byte in the preset item bytes.

Optionally, the classification unit further comprises: the second determining module is used for determining that a working relationship exists between the first user and the second user in the target account according to the chat record; the extraction module is used for extracting working keywords from the working relations, wherein the working keywords are used for representing the working relations among users; and the classification module is used for classifying the plurality of key phrases by utilizing the working key words to obtain a plurality of classified target phrase sets.

Optionally, the encoding unit includes: the coding module is used for carrying out word vector coding on each phrase in the target phrase set to obtain coded data; the decoding module is used for decoding the coded data by utilizing a multi-task decoder to obtain text data meeting the target style, wherein the multi-task decoder is used for decoding the coded data according to the preset style, the number of the preset style is at least one, and the semantic meaning expressed by the text data is the same as the semantic meaning expressed by each phrase.

According to yet another aspect of the embodiments of the present application, there is also provided an electronic device including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory complete communication with each other through the communication bus; wherein the memory is used for storing a computer program; a processor for executing the method steps of text data processing in any of the above embodiments by running the computer program stored on the memory.

According to a further aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the method steps of the text data processing in any of the embodiments described above when run.

In the embodiment of the application, a mode of acquiring chat record data of working interaction software, processing and integrating the chat record data is adopted, and the chat record stored in the interaction software is acquired, wherein the interaction software is used for recording communication information of a target account, and the target account is an account used in the interaction software; extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and Guan Jici; classifying the plurality of key phrases by utilizing a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold value; and encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles. According to the method and the device, the useful data are screened through collection, filtration and arrangement of the data, the useful data are finely arranged and classified, finally, the classified plurality of target phrase sets are encoded, and the target style text data conforming to the target account is generated.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a hardware environment of an alternative method of text data processing according to an embodiment of the invention;

FIG. 2 is a flow diagram of an alternative method of text data processing according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an alternative textCNN architecture according to an embodiment of the present application;

FIG. 4 is an overall flow diagram of an alternative phrase extraction method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an alternative model for generating text data of a target style according to an embodiment of the present application;

FIG. 6 is a block diagram of an alternative text data processing apparatus according to an embodiment of the present application;

Fig. 7 is a block diagram of an alternative electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to one aspect of an embodiment of the present application, a method of text data processing is provided. Alternatively, in the present embodiment, the above-described text data processing method may be applied to a hardware environment as shown in fig. 1. As shown in fig. 1, the terminal 102 may include a memory 104, a processor 106, and a display 108 (optional components). The terminal 102 may be communicatively coupled to a server 112 via a network 110, the server 112 being operable to provide services (e.g., gaming services, application services, etc.) to the terminal or to clients installed on the terminal, and a database 114 may be provided on the server 112 or independent of the server 112 for providing data storage services to the server 112. In addition, a processing engine 116 may be run in the server 112, which processing engine 116 may be used to perform the steps performed by the server 112.

Alternatively, the terminal 102 may be, but is not limited to, a terminal capable of calculating data, such as a mobile terminal (e.g., a mobile phone, a tablet computer), a notebook computer, a PC (Personal Computer ) or the like, which may include, but is not limited to, a wireless network or a wired network. Wherein the wireless network comprises: bluetooth, WIFI (Wireless Fidelity ) and other networks that enable wireless communications. The wired network may include, but is not limited to: wide area network, metropolitan area network, local area network. The server 112 may include, but is not limited to, any hardware device that can perform calculations.

In addition, in this embodiment, the method for processing text data may be applied to, but not limited to, a stand-alone processing device with a relatively high processing capability, without data interaction. For example, the processing device may be, but is not limited to, a more powerful terminal device, i.e. the individual operations of the above-described method of text data processing may be integrated in a separate processing device. The above is merely an example, and is not limited in any way in the present embodiment.

Alternatively, in this embodiment, the method of text data processing may be performed by the server 112, or may be performed by the terminal 102, or may be performed by both the server 112 and the terminal 102. The method for executing text data processing by the terminal 102 according to the embodiment of the present application may also be executed by a client installed thereon.

Taking a server as an example, fig. 2 is a schematic flow chart of an alternative text data processing method according to an embodiment of the present application, and as shown in fig. 2, the flow chart of the method may include the following steps:

step S201, a chat record stored in the interactive software is obtained, wherein the interactive software is used for recording communication information of a target account, and the target account is an account used in the interactive software.

Optionally, in the embodiment of the present application, the relevant chat record information of the user on the target account may be obtained by using the interaction software (such as some working communication software: xx), where the target account is an account registered by the user using the interaction software on the interaction software, and all information of the user performing communication interaction with other users by using the interaction software is stored in the target account.

In the embodiment of the application, after the chat record stored in the interactive software by the target account is obtained, data preprocessing is performed on the chat record, wherein the data preprocessing block is mainly used for filtering data of a background of the working interactive software within the period of time based on a certain user, and cleaning out some chat content irrelevant to the progress of the working content. A text classification algorithm is mainly adopted to classify sentence information of a piece, and chat sentences of contents related to work are found out and used as corpora which are extracted subsequently.

The text classification algorithm may adopt a plurality of different model structures (e.g., textCNN (Text Convolutional Neural Networks, text convolutional neural network), transducer, etc.), but due to the fact that chat sentences are generally short and sentence structures are not very complex, the text classification algorithm can be used for classifying the text, and data preprocessing is completed. See in particular fig. 3.

Fig. 3 is a schematic diagram of an alternative TextCNN structure according to an embodiment of the present application, in fig. 3, inputLayer represents an input layer of a model, assembled represents a word vector representation structure of the model, conv1Dv represents a 1-dimensional CNN structure, maxPooling1D represents a 1-dimensional max pooling layer, concate represents splicing hidden information output by different branch structures, flat represents expanding a spliced two-dimensional vector into a 1-dimensional vector, transform represents a fully connected structure, input represents input, output represents output, and finally, a classification result is calculated by softMax.

Step S202, extracting entities and related words among the entities by utilizing a target model to extract the phrases in the chat records to obtain a plurality of key phrases, wherein the key phrases comprise the entities and the related words.

Optionally, a target model of an extraction algorithm (such as a Bi-directional Long Short-Term Memory (Bi-long and short Term Memory cyclic neural network), bert, etc.) is selected, and the extraction of the entities and the extraction of the related words between the entities are performed on the phrase in the chat record, so as to obtain the corresponding entities or the relation between the entities, so as to obtain a plurality of key phrases, wherein each key phrase contains a plurality of entities and related words between the entities.

In this embodiment of the present application, by constructing a corresponding deep learning algorithm model and rules, extracting key nouns, verbs and relational words, specifically, refer to an overall flow schematic of a phrase extraction method shown in fig. 4, where the overall flow schematic is divided into two parts, namely a triplet product processing part and an extraction flow part:

a triplet product processing part: and acquiring a triplet product, establishing project information according to the triplet product, constructing a body, importing a dictionary into the body, importing rules, manually marking and identifying an entity.

The extraction flow portion is divided into a training sub-portion and a predictive sub-portion. Firstly, determining a body (namely a relation between an entity and the entity) according to the body construction requirement, and then, putting manually marked data into training data of a training sub-part to generate a training model;

and (3) performing parameter adjustment and other treatments on the training model according to the dictionary (such as NER (Named Entity Recognition, named entity recognition) dictionary) and the rule (such as NER (Named Entity Recognition, named entity recognition) rule) imported by the triplet product treatment part, further determining an output model (namely an algorithm model in the figure), and extracting the entity, the entity relation words and the like by using the determined algorithm model.

Step S203, classifying the plurality of key phrases by using the target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold.

Optionally, after a plurality of key phrases are obtained, the time of the chat records and the categories of different project groups and project association personnel are utilized to carry out induction and arrangement, so as to form a target phrase set after one classification, the phrases in the target phrase set can represent a certain working key word working within a certain period of time, meanwhile, the association degree between the phrases contained in the target phrase set is greater than a preset threshold, that is, the association degree between the phrases in the target phrase set is greater, and the phrases can be classified into one category, wherein the preset threshold is the lowest value of the association degree between the phrases, and the association degree between the phrases contained in the target phrase set is usually required to be greater than the preset threshold.

Step S204, each phrase in the target phrase set is encoded to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles.

Optionally, determining a text style of the target account, and performing encoding processing on each phrase in the target phrase set by using an encoder by the server to obtain text data meeting a target style corresponding to the target account, wherein the target style is a style matched with the target account in a plurality of preset style styles, the preset style can be set in the server in advance by a user, and after the encoding processing, the text data meeting the requirement of the user writing style can be decoded according to the requirement of the user.

It should be noted that, the text data may be text data such as a job summary, and the embodiment of the present application does not limit the specific content of the text data. The preset style may be a business style, a lovely style, a standard style, etc., and the target style may be any one of the three preset styles.

In the embodiment of the application, the chat record data of the working interactive software is obtained, processed and integrated, and the chat record stored in the interactive software is obtained, wherein the interactive software is used for recording the communication information of a target account, and the target account is an account used in the interactive software; extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and Guan Jici; classifying the plurality of key phrases by utilizing a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold value; and encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles. According to the method and the device, the useful data are screened through collection, filtration and arrangement of the data, the useful data are finely arranged and classified, finally, the classified plurality of target phrase sets are encoded, and the target style text data conforming to the target account is generated.

As an optional embodiment, classifying the plurality of keyword groups by using the target scheme, and obtaining the plurality of classified target phrase sets includes:

obtaining time information corresponding to the chat record;

determining a preset step length for dividing the time information, wherein the preset step length is a fixed value;

and dividing the time information by using a preset step length to obtain a plurality of target phrase sets.

Optionally, the chat records can be categorized according to preset step sizes set in advance, more specifically, the server obtains time information corresponding to the chat records, and the time information is fixedly divided by using the preset step sizes, so that a plurality of target phrase sets can be obtained.

For example, the preset step length is set to be 5 minutes, so that chat records obtained every five minutes are summarized into the same target phrase set.

According to the method and the device, the time information is divided by utilizing the preset step length, so that each target phrase set is obtained, and extraction of entities or other information in the phrases is facilitated.

As an alternative embodiment, obtaining the time information corresponding to the chat record includes:

acquiring the quantity information of chat records;

Carrying out average calculation on the quantity information to obtain average information;

taking the mean value information as a preset step length.

Optionally, the embodiment of the present application may perform average calculation on the number of acquired chat records to obtain an average value, and use the average value as a preset step length to serve as a basis for dividing time information.

For example, 6 chat records with an average value of 3 are obtained, the number 3 can be used as a preset step length, and time information is divided, so that two target phrase sets can be obtained.

sequencing the time information according to the time sequence to obtain a sequencing result;

dividing a first chat record with the time difference between two adjacent time information in the sequencing result smaller than or equal to a preset difference value into a first target phrase set, and dividing a second chat record except the first chat record into a second target phrase set, wherein the first target phrase set and the second target phrase set are subsets of the target phrase set.

Optionally, the embodiment of the present application needs to perform time sequencing on the acquired time information to form a sequenced sequencing result.

The method comprises the steps of obtaining a preset difference value, dividing chat records by the preset difference value, further, obtaining time difference between two adjacent time information in a sequencing result, comparing the time difference with the preset difference value, dividing the chat records with the time difference smaller than or equal to the preset difference value into a first target phrase set, and dividing the chat records with the time difference larger than the preset difference value into a second target phrase set. The first target phrase set and the second target phrase set are subsets of the target phrase set.

For example, the sorting result includes a chat record of time information 8:00, a chat record of 8:03, a chat record of 8:10, and a chat record of 8:20, where the time difference between 8:00 and 8:03 is 3, the time difference between 8:03 and 8:10 is 7, and the time difference between 8:10 and 8:20 is 10. If the preset difference is set to 5, the chat records are 8:00 and 8:03, and the rest 8:10 and 8:20 are required to be divided into a first target phrase set. Meanwhile, since 8:03 is already divided into the first target phrase set, the time difference between 8:03 and 8:10 is 7, and then 8:03 is divided into the first target phrase set, that is, each chat record can only be divided into one target phrase set, and the division cannot be repeated.

In the embodiment of the application, the division result can be determined according to the comparison condition between the time difference between the two adjacent time information and the preset difference value, and the division of the keyword groups is facilitated.

As an optional embodiment, the method further includes, before dividing the first chat record in which the time difference between two adjacent pieces of time information in the ranking result is less than or equal to the preset difference value into the first target phrase set and dividing the second chat record other than the first chat record into the second target phrase set:

dividing the second chat record into a first target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is larger than or equal to a preset threshold value;

and dividing the second chat record into a second target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is smaller than a preset threshold value.

Optionally, before the second chat record is divided according to the preset difference, it may be further determined whether the second chat record may be also divided into the first target phrase set according to a degree of association between the second chat record and the first chat record.

Further, the association degree between the entity in the second chat record and the entity in the first chat record may be determined, and the second chat record is divided into the first target phrase set if the association degree between the entity in the second chat record and the entity in the first chat record is greater than or equal to the preset threshold, or the second chat record is divided into the second target phrase set.

In the embodiment of the application, the division result can be determined according to the association relation among the entities, which is beneficial to improving the division accuracy of the keyword groups.

As an optional embodiment, classifying the plurality of keyword groups by using the target scheme, and obtaining the plurality of classified target phrase sets further includes:

matching the entity in the key phrase with a preset item byte by using a byte matching scheme, wherein the preset item byte is used for indicating an item to which the entity belongs;

under the condition that a matching result between the target item bytes and the entity in the preset item bytes is larger than a preset matching threshold, attributing the entity to the target item corresponding to the target item bytes to obtain a target phrase set, wherein the entity in one target phrase set belongs to the same item, and the target item bytes are any item byte in the preset item bytes.

Optionally, the embodiment of the present application may further classify the group of items to which the keyword group belongs according to classification, and more specifically, a byte matching scheme may be utilized to match an entity in the keyword group with a preset item byte, where the preset item byte is usually a keyword representing each item, for example, "EV construction" in an EV project construction state, and "state" is a preset item byte, where, according to a matching result of the entity in the keyword group and the preset item byte, if a target item byte exists in the preset item byte, for example, the target item byte is set to "EV construction", and if a matching result between the target item byte and the entity is greater than a preset matching threshold (which may be 80%), the entity is belonged to a target item corresponding to the target item byte, and then belonged to the EV project, so as to obtain the target phrase set.

Because the preset item bytes contain more item bytes, the entity is considered to belong to the target item as long as the matching result between the target item bytes in the preset item bytes and the entity in the keyword group is greater than the preset matching threshold, and the target item in the embodiment is the EV item. Meanwhile, it can be understood that, because the embodiment of the application is a target phrase set composed of entities belonging to the same target item, the entities in one target phrase set belong to the same item.

In the embodiment of the application, classification can be performed according to the item to which the entity belongs, so that a plurality of target phrase sets are obtained, and accurate classification of the subsequent key phrases is facilitated.

determining that a working relationship exists between a first user and a second user in the target account according to the chat record;

extracting working keywords from the working relations, wherein the working keywords are used for representing the working relations among users;

and classifying the plurality of key phrase groups by utilizing the working key words to obtain a plurality of classified target phrase groups.

Optionally, the embodiment of the present application may further use a working relationship between users to classify, further obtain that a working relationship exists between a first user and a second user recorded in a target account, for example, a context relationship exists between the first user and the second user recorded in a chat record, at this time, working keywords, for example, working keywords such as "repair", "electric appliance", etc., need to be further extracted from the working relationship, at this time, working association between the first user and the second user, which is an electrician, can be known according to the keywords, at this time, only phrases related to the working keywords "repair", "electric appliance" in the chat record need to be classified, and then a plurality of target phrase sets classified according to the working keywords are obtained.

In the embodiment of the application, the categories of project association personnel can be generalized and arranged to obtain a plurality of target phrase sets, so that the accurate division of the subsequent key phrases is facilitated.

As an alternative embodiment, encoding each phrase in the target phrase set to obtain text data meeting the target style includes:

performing word vector coding on each phrase in the target phrase set to obtain coded data;

and decoding the encoded data by using a multi-task decoder to obtain text data meeting the target style, wherein the multi-task decoder is used for decoding the encoded data according to preset style types, the number of the preset style types is at least one, and the semantic meaning expressed by the text data is the same as the semantic meaning expressed by each phrase.

Optionally, as shown in fig. 5, the same unified encoder (e.g., biLSTM/transducer) and multi-task decoder (e.g., biLSTM/transducer+beam search/group/HMM) are included in fig. 5, where each phrase in the plurality of target phrase sets may be input into the encoder for word vector encoding to obtain encoded data;

and then decoding the encoded data by using a multi-task decoder to obtain a plurality of decoded text data, wherein the text data is the text data conforming to the preset style. The number of the preset style styles is multiple, and the task decoder is also multiple, so that the styles of the obtained text data are also multiple, and of course, the styles of the text data comprise daily writing styles of users using the target account, for example, the preset style can comprise business styles, lovely styles and standard styles, the text data decoded by the multi-task decoder can correspond to the three styles, and only one text data meeting the target style of the target account can be obtained.

It will be appreciated that the semantics of the corresponding text data representation should be the same, with only differences in style, regardless of whether the text data derived by the decoder is business style, lovely style, or standard style.

In the embodiment of the application, diversified text data can be generated by word vector coding of each phrase and decoding processing of a multi-task decoder so as to obtain work summary conforming to own style, thus enhancing text diversity and improving user interest.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM (Read-Only Memory)/RAM (Random Access Memory), magnetic disk, optical disk), including instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the embodiments of the present application.

According to another aspect of the embodiments of the present application, there is also provided a text data processing apparatus for implementing the above-mentioned text data processing method. Fig. 6 is a block diagram of an alternative text data processing apparatus according to an embodiment of the present application, as shown in fig. 6, the apparatus may include:

the obtaining unit 601 is configured to obtain a chat record stored in the interaction software, where the interaction software is configured to record communication information of a target account, and the target account is an account used in the interaction software;

the extracting unit 602 is connected to the obtaining unit 601, and is configured to extract entities and related words between the entities by using the target model to obtain a plurality of key phrases, where the key phrases include the entities and Guan Jici;

the classifying unit 603 is connected to the extracting unit 602, and is configured to classify the plurality of key phrases by using a target scheme to obtain a plurality of classified target phrase sets, where a degree of association between phrases in the target phrase sets is greater than a preset threshold;

the encoding unit 604 is connected to the classifying unit 603, and is configured to encode each phrase in the target phrase set to obtain text data satisfying a target style, where the target style is a style matched with the target account in multiple preset style styles.

It should be noted that the acquiring unit 601 in this embodiment may be used to perform the above-described step S201, the extracting unit 602 in this embodiment may be used to perform the above-described step S202, the classifying unit 603 in this embodiment may be used to perform the above-described step S203, and the encoding unit 604 in this embodiment may be used to perform the above-described step S204.

Through the module, the chat record data of the working interactive software are obtained, processed and integrated, and the chat record stored in the interactive software is obtained, wherein the interactive software is used for recording the communication information of a target account, and the target account is an account used in the interactive software; extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and Guan Jici; classifying the plurality of key phrases by utilizing a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold value; and encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles. According to the method and the device, the useful data are screened through collection, filtration and arrangement of the data, the useful data are finely arranged and classified, finally, the classified plurality of target phrase sets are encoded, and the target style text data conforming to the target account is generated.

As an alternative embodiment, the classification unit comprises: the acquisition module is used for acquiring time information corresponding to the chat record; the first determining module is used for determining a preset step length for dividing the time information, wherein the preset step length is a fixed value; the first dividing module is used for dividing the time information by utilizing a preset step length to obtain a plurality of target phrase sets.

As an alternative embodiment, the obtaining module includes: an acquisition subunit, configured to acquire the number information of the chat records; the calculating subunit is used for carrying out average calculation on the quantity information to obtain average information; and the setting subunit is used for taking the mean value information as a preset step length.

As an alternative embodiment, the classification unit comprises: the sequencing module is used for sequencing the time information according to the time sequence to obtain a sequencing result; the second dividing module is used for dividing a first chat record with the time difference between two adjacent time information in the sequencing result being smaller than or equal to a preset difference value into a first target phrase set, and dividing a second chat record except the first chat record into a second target phrase set, wherein the first target phrase set and the second target phrase set are subsets of the target phrase set.

As an alternative embodiment, the apparatus further comprises: the first dividing unit is used for dividing the second chat record into the first target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is larger than or equal to a preset threshold value; and the second dividing unit is used for dividing the second chat record into a second target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is smaller than a preset threshold value.

As an alternative embodiment, the classification unit comprises: the matching module is used for matching the entity in the key word group with a preset item byte by utilizing a byte matching scheme, wherein the preset item byte is used for indicating an item to which the entity belongs; the attribution module is used for attributing the entity to a target item corresponding to the target item byte to obtain a target phrase set under the condition that a matching result between the target item byte and the entity is larger than a preset matching threshold value in the preset item byte, wherein the entity in one target phrase set is attributed to the same item, and the target item byte is any item byte in the preset item byte.

As an alternative embodiment, the classification unit further comprises: the second determining module is used for determining that a working relationship exists between the first user and the second user in the target account according to the chat record; the extraction module is used for extracting working keywords from the working relations, wherein the working keywords are used for representing the working relations among users; and the classification module is used for classifying the plurality of key word groups by utilizing the working key words to obtain a plurality of classified target word group sets.

As an alternative embodiment, the encoding unit comprises: the coding module is used for carrying out word vector coding on each phrase in the target phrase set to obtain coded data; the decoding module is used for decoding the coded data by utilizing the multi-task decoder to obtain text data meeting the target style, wherein the multi-task decoder is used for decoding the coded data according to the preset style, the number of the preset style is at least one, and the semantic meaning expressed by the text data is the same as the semantic meaning expressed by each phrase.

It should be noted that the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above embodiments. It should be noted that the above modules may be implemented in software or in hardware as part of the apparatus shown in fig. 1, where the hardware environment includes a network environment.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device for implementing the above-mentioned text data processing method, where the electronic device may be a server, a terminal, or a combination thereof.

Fig. 7 is a block diagram of an alternative electronic device, according to an embodiment of the present application, including a processor 701, a communication interface 702, a memory 703, and a communication bus 704, as shown in fig. 7, wherein the processor 701, the communication interface 702, and the memory 703 communicate with each other via the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to execute the computer program stored in the memory 703, and implement the following steps:

s1, obtaining chat records stored in interactive software, wherein the interactive software is used for recording communication information of a target account, and the target account is an account used in the interactive software;

s2, extracting entities and related words among the entities by utilizing a target model to extract the phrases in the chat records to obtain a plurality of key phrases, wherein the key phrases comprise the entities and Guan Jici;

s3, classifying the plurality of key phrases by using a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold;

And S4, encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles.

Alternatively, in the present embodiment, the above-described communication bus may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or an EISA (Extended Industry Standard Architecture ) bus, or the like. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, only one thick line is shown in fig. 7, but not only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The memory may include RAM or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

As an example, as shown in fig. 7, the memory 703 may include, but is not limited to, an acquisition unit 601, an extraction unit 602, a classification unit 603, and an encoding unit 604 in an apparatus including the text data processing. In addition, other module units in the apparatus for text data processing may be included, but are not limited to, and are not described in detail in this example.

The processor may be a general purpose processor and may include, but is not limited to: CPU (Central Processing Unit ), NP (Network Processor, network processor), etc.; but also DSP (Digital Signal Processing, digital signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field-Programmable Gate Array, field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In addition, the electronic device further includes: and the display is used for displaying the text data processing result.

Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments, and this embodiment is not described herein.

It will be understood by those skilled in the art that the structure shown in fig. 7 is only illustrative, and the device implementing the method for processing text data may be a terminal device, and the terminal device may be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palmtop computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 7 is not limited to the structure of the electronic device described above. For example, the terminal device may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in fig. 7, or have a different configuration than shown in fig. 7.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, ROM, RAM, magnetic or optical disk, etc.

According to yet another aspect of embodiments of the present application, there is also provided a storage medium. Alternatively, in the present embodiment, the above-described storage medium may be used for program codes of a method of performing text data processing.

Alternatively, in this embodiment, the storage medium may be located on at least one network device of the plurality of network devices in the network shown in the above embodiment.

Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of:

Alternatively, specific examples in the present embodiment may refer to examples described in the above embodiments, which are not described in detail in the present embodiment.

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: various media capable of storing program codes, such as a U disk, ROM, RAM, a mobile hard disk, a magnetic disk or an optical disk.

According to yet another aspect of embodiments of the present application, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium; the processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method steps of text data processing in any of the embodiments described above.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions to cause one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method for text data processing of the various embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and are merely a logical functional division, and there may be other manners of dividing the apparatus in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution provided in the present embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims

1. A method of text data processing, the method comprising:

the method comprises the steps of obtaining chat records stored in interaction software, wherein the interaction software is used for recording communication information of a target account, and the target account is an account used in the interaction software;

Extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and the related words;

classifying the plurality of key phrases by using a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between phrases in the target phrase sets is larger than a preset threshold; the step of classifying the plurality of key phrase groups by using the target scheme to obtain a plurality of classified target phrase sets comprises the following steps: obtaining time information corresponding to the chat record; determining a preset step length for dividing the time information, wherein the preset step length is a fixed value; dividing the time information by using the preset step length to obtain a plurality of target phrase sets, wherein the method comprises the following steps: sequencing the time information according to the time sequence to obtain a sequencing result; dividing the second chat record into the first target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is larger than or equal to the preset threshold value; dividing the second chat record into the second target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is smaller than the preset threshold value; dividing a first chat record with the time difference between two adjacent time information in the sequencing result being smaller than or equal to a preset difference value into a first target phrase set, and dividing a second chat record except the first chat record into a second target phrase set, wherein the first target phrase set and the second target phrase set are subsets of the target phrase set;

And encoding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles.

2. The method of claim 1, wherein the obtaining the time information corresponding to the chat record comprises:

acquiring the quantity information of the chat records;

and taking the mean value information as the preset step length.

3. The method of claim 1, wherein classifying the plurality of keyword groups using the target scheme to obtain a plurality of classified target phrase sets further comprises:

and under the condition that a matching result between a target item byte and the entity in the preset item bytes is larger than a preset matching threshold, attributing the entity to a target item corresponding to the target item byte to obtain the target phrase set, wherein the entity in one target phrase set is attributed to the same item, and the target item byte is any item byte in the preset item bytes.

4. The method of claim 1, wherein classifying the plurality of keyword groups using the target scheme to obtain a plurality of classified target phrase sets further comprises:

extracting working keywords from the working relationship, wherein the working keywords are used for representing the working relationship among users;

and classifying the plurality of key phrases by using the working key words to obtain a plurality of classified target phrase sets.

5. The method according to any one of claims 1 to 4, wherein encoding each phrase in the set of target phrases to obtain text data satisfying a target style comprises:

and decoding the encoded data by using a multi-task decoder to obtain text data meeting the target style, wherein the multi-task decoder is used for decoding the encoded data according to the preset style, the number of the preset style is at least one, and the semantic meaning expressed by the text data is the same as the semantic meaning expressed by each phrase.

6. An apparatus for text data processing, the apparatus comprising:

the system comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is used for acquiring chat records stored in interaction software, the interaction software is used for recording communication information of a target account, and the target account is an account used in the interaction software;

the extraction unit is used for extracting entities and related words among the entities by utilizing a target model to obtain a plurality of key word groups, wherein the key word groups comprise the entities and the related words;

the classification unit is used for classifying the plurality of key phrases by utilizing a target scheme to obtain a plurality of classified target phrase sets, wherein the association degree between the phrases in the target phrase sets is larger than a preset threshold value; the step of classifying the plurality of key phrase groups by using the target scheme to obtain a plurality of classified target phrase sets comprises the following steps: obtaining time information corresponding to the chat record; determining a preset step length for dividing the time information, wherein the preset step length is a fixed value; dividing the time information by using the preset step length to obtain a plurality of target phrase sets, wherein the method comprises the following steps: sequencing the time information according to the time sequence to obtain a sequencing result; dividing the second chat record into the first target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is larger than or equal to the preset threshold value; dividing the second chat record into the second target phrase set under the condition that the association degree between the entity in the second chat record and the entity in the first chat record is smaller than the preset threshold value; dividing a first chat record with the time difference between two adjacent time information in the sequencing result being smaller than or equal to a preset difference value into a first target phrase set, and dividing a second chat record except the first chat record into a second target phrase set, wherein the first target phrase set and the second target phrase set are subsets of the target phrase set;

And the coding unit is used for coding each phrase in the target phrase set to obtain text data meeting a target style, wherein the target style is a style matched with the target account in a plurality of preset style styles.

7. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus, characterized in that,

the memory is used for storing a computer program;

the processor is configured to perform the method steps of text data processing according to any one of claims 1 to 5 by running the computer program stored on the memory.

8. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to perform the method steps of text data processing as claimed in any of claims 1 to 5 when run.