CN112199578B - Information processing method and apparatus, electronic device, and storage medium - Google Patents

Information processing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
CN112199578B
CN112199578B CN202010886868.1A CN202010886868A CN112199578B CN 112199578 B CN112199578 B CN 112199578B CN 202010886868 A CN202010886868 A CN 202010886868A CN 112199578 B CN112199578 B CN 112199578B
Authority
CN
China
Prior art keywords
information
comment information
preset
comment
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010886868.1A
Other languages
Chinese (zh)
Other versions
CN112199578A (en
Inventor
江霜艳
李东升
崔鸣
刘娜
陈开江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Seashell Housing Beijing Technology Co Ltd
Original Assignee
贝壳找房(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 贝壳找房(北京)科技有限公司 filed Critical 贝壳找房(北京)科技有限公司
Priority to CN202010886868.1A priority Critical patent/CN112199578B/en
Publication of CN112199578A publication Critical patent/CN112199578A/en
Application granted granted Critical
Publication of CN112199578B publication Critical patent/CN112199578B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure discloses an information processing method and device, an electronic device and a storage medium, wherein the method comprises the following steps: receiving an audit request message, wherein the audit request message comprises comment information and basic information of a target project; performing at least one preset pattern matching on the comment information based on the basic information; responding to the comment information matched through the at least one preset pattern, and obtaining a probability value of the comment information belonging to publishable information through a machine learning model based on at least one feature extracted from the comment information; determining whether the comment information belongs to publishable information based on the probability value. The embodiment of the disclosure can realize the auditing treatment of the item comment information, and avoid the conditions of inconsistency with the actual condition of the item, error information or invalid information.

Description

Information processing method and apparatus, electronic device, and storage medium
Technical Field
The present disclosure relates to data processing technologies, and in particular, to an information processing method and apparatus, an electronic device, and a storage medium.
Background
With the development of internet technology and the popularization of mobile terminals, more and more internet service providers provide services such as sales and transactions of items (such as commodities, products, services and the like) in an online manner through browsers, application programs (APPs) and the like, and users can online know related contents of interested items and purchase the related items without going out. In order to make the user fully understand the information related to the online item, there are usually professional service personnel or users to comment on the online item.
However, in the process of implementing the present invention, the inventor finds, through research, that there are often some pieces of obvious wrong information or information unrelated to the current project, in which comment information published by service personnel or users on the online project does not match actual information of the project, so that other users cannot correctly know objective conditions of the project, even know the wrong information.
For example, in the current house leasing and buying and selling industries, house resources are mostly displayed through browsers and APPs, text, pictures, audio and video introduction and the like are carried out on the basic conditions of houses, and for the displayed house resources, brokers usually issue house resource comment information to introduce the characteristics of the displayed house resources, so that users can objectively and fully know the advantages and disadvantages of the displayed house resources. However, some house evaluation information is inconsistent with basic information for displaying house resources and lacks authenticity, so that house actual conditions which a house demander (a tenant or a buyer) needs to know by watching houses through lines are different from the house evaluation information, house transaction is failed, and a lot of manpower and time are wasted for a house brokerage company and the house demander.
Disclosure of Invention
The embodiment of the disclosure provides an information processing method and device, electronic equipment and a storage medium, so as to implement auditing processing on item comment information and avoid the situation that the item comment information does not conform to the actual situation of an item, error information or invalid information occurs.
In an aspect of the disclosed embodiments, an information processing method is provided, including:
receiving an audit request message, wherein the audit request message comprises comment information and basic information of a target project;
performing at least one preset pattern matching on the comment information based on the basic information;
responding to the comment information matched through the at least one preset pattern, and obtaining a probability value of the comment information belonging to publishable information through a machine learning model based on at least one feature extracted from the comment information;
determining whether the comment information belongs to publishable information based on the probability value.
Optionally, in any of the method embodiments of the present disclosure above, the method further includes:
and in response to that the comment information is not matched with any one or more preset patterns in the at least one preset pattern, outputting an audit result notification message that the comment information is not audited, wherein the audit result notification message that the comment information is not audited includes information that is not matched with the at least one preset pattern in the comment information.
Optionally, in any of the method embodiments of the present disclosure, before receiving the audit request message, the method further includes:
in response to the detection of a piece of newly added comment information, taking the item targeted by the newly added comment information as a target item, taking the newly added comment information as comment information of the target item, acquiring basic information of the target item from an item database, and generating and sending the audit request message; alternatively, the first and second electrodes may be,
according to a preset auditing period, respectively taking a project as a target project, sequentially selecting a piece of newly-added comment information of the target project in the auditing period as comment information of the target project, acquiring basic information of the target project from a project database, and generating and sending auditing request information.
Optionally, in any of the above method embodiments of the present disclosure, the at least one preset mode includes any one or more of: the consistency mode of the comment information and the basic information does not exist, the mode of the limiting keywords does not exist, the mode accords with the preset logic combination mode, and the mode of repeatedly describing information does not exist;
the comment information is subjected to preset pattern matching based on the basic information, and the preset pattern matching comprises any one or more of the following items:
matching the comment information with the basic information to determine whether information inconsistent with the basic information exists in the comment information; if the comment information does not have information inconsistent with the basic information, the comment information is matched with the basic information in a consistency mode;
identifying whether the comment information comprises a keyword in a first preset keyword table; the first preset keyword table comprises at least one keyword forbidden to be used; if the comment information does not include keywords in a preset keyword table, matching the comment information through the non-occurrence limited keyword mode;
identifying whether the comment information contains description information which does not accord with a preset logic combination mode; the preset logic combination mode comprises two or more sub-modes with logic relation; if the comment information does not have description information which does not accord with the preset logic combination mode, the comment information is matched through the mode which accords with the preset logic combination mode;
identifying whether repeated description information aiming at different dimensions exists in the comment information; if repeated description information aiming at different dimensions does not exist in the comment information, the comment information is matched through the mode of the non-existent repeated description information;
the comment information is matched through the at least one preset pattern, and the comment information comprises: the comment information is matched through all preset patterns in the at least one preset pattern.
Optionally, in any one of the method embodiments of the present disclosure, the matching the comment information with the basic information to determine whether there is information inconsistent with the basic information in the comment information includes:
extracting attribute information of each dimension of the target item from the basic information, and extracting description information of at least one dimension of the target item from the comment information; the basic information of the target item comprises attribute information of each dimension of the target item, and the comment information of the target item comprises description information of at least one dimension of the target item;
and comparing whether the description information of at least one dimension has information inconsistent with the attribute information of the corresponding dimension in the basic information.
Optionally, in any one of the method embodiments of the present disclosure, the identifying whether there is description information that does not conform to a preset logical combination mode in the comment information includes:
extracting description information of at least one dimension for the target item from the comment information;
and identifying whether the description information of the at least one dimension has description information which does not accord with a preset logic combination mode.
Optionally, in any one of the method embodiments of the present disclosure, the identifying whether there is repeated description information for different dimensions in the comment information includes:
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; wherein the presence of repeated description information for different dimensions in the comment information includes: the hash values of the description information of different dimensions are the same;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information; segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; calculating the hashed values of the multiple segments of each dimension respectively, and calculating the Jacard Jaccard coefficients between the hashed values of the multiple segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; wherein the presence of repeated description information for different dimensions in the comment information includes: the condition that the Jaccard coefficients of different dimensions are larger than a preset similarity threshold exists;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; if the situation that the hash values of the description information of different dimensions are the same does not exist, segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; respectively calculating hash values of a plurality of segments of each dimension, and calculating Jaccard coefficients between the hash values of a plurality of segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; the existence of repeated description information aiming at different dimensions in the comment information comprises the following steps: there are cases where the hash values of the description information for different dimensions are the same, or there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
Optionally, in any one of the method embodiments of the present disclosure, the obtaining, by the machine learning model, a probability value of whether the comment information belongs to publishable information based on at least one feature extracted from the comment information includes:
extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature; the second preset keyword table comprises at least one keyword for limiting use;
extracting feature information of each keyword in a third preset keyword table from the comment information to obtain a second feature; the third preset keyword table comprises at least one error information keyword;
and obtaining a probability value of the comment information belonging to the publishable information based on the first characteristic and the second characteristic through the machine learning model.
Optionally, in any method embodiment of the present disclosure, the extracting, from the comment information, feature information of each keyword in a second preset keyword table to obtain a first feature includes:
acquiring keywords in a second preset keyword table included in the comment information to obtain first target keywords;
respectively aiming at each first target keyword, obtaining the product of the word frequency TF of the first target keyword in the comment information and the inverse document frequency IDF of the first target keyword to obtain the word frequency-inverse document frequency TF-IDF value of the first target keyword; the first feature comprises a TF-IDF value of the first target keyword;
and/or the presence of a gas in the gas,
extracting feature information of each keyword in a third preset keyword list from the comment information to obtain a second feature, wherein the second feature comprises:
acquiring keywords in a third preset keyword table included in the comment information to obtain second target keywords;
respectively aiming at each second target keyword, obtaining the product of TF of the second target keyword in the comment information and IDF of the second target keyword to obtain TF-IDF value of the second target keyword; the first feature includes a TF-IDF value of the second target keyword.
Optionally, in any one of the method embodiments of the present disclosure, the obtaining, by the machine learning model, a probability value that the comment information belongs to publishable information based on at least one feature extracted from the comment information further includes:
extracting features of the description information of each dimension of the target item from the comment information respectively to obtain at least one dimension feature;
extracting the characteristics of the basic attribute information of the target project from the basic information to obtain project attribute characteristics;
the obtaining, by the machine learning model, a probability value that the comment information belongs to the issuable information based on the first feature and the second feature includes: obtaining, by the machine learning model, a probability value that the comment information belongs to the issuable information based on the first feature, the second feature, the at least one dimension feature, and the item attribute feature.
Optionally, in any one of the method embodiments of the present disclosure, the basic attribute information includes: city information and project type information; and/or the presence of a gas in the gas,
the extracting features of the description information of each dimension for the target item from the comment information respectively to obtain at least one dimension feature includes:
respectively aiming at each dimension of the target item, extracting the characteristics of each word in the description information from the comment information;
respectively aiming at each dimension of the target item, acquiring preset high-frequency words matched with the features of each word in the description information as the features of the corresponding dimension based on the similarity between the features of each word in the description information and the features of each preset high-frequency word of the corresponding dimension; the at least one dimensional feature comprises: and the matched preset high-frequency words corresponding to all dimensions of the target item.
Optionally, in any of the above method embodiments of the present disclosure, the machine learning model includes: and (5) gradient lifting iterative decision tree GDBT model.
In another aspect of the disclosed embodiments, there is provided an information processing apparatus including:
the system comprises a receiving module, a verification processing module and a verification processing module, wherein the receiving module is used for receiving a verification request message which comprises comment information and basic information of a target project;
the mode matching module is used for carrying out at least one preset mode matching on the comment information based on the basic information;
the machine learning model is used for responding to the comment information and obtaining a probability value of the comment information belonging to the publishable information based on at least one feature extracted from the comment information through the at least one preset pattern matching;
and the determining module is used for determining whether the comment information belongs to publishable information or not based on the probability value.
Optionally, in any one of the apparatus embodiments of the present disclosure, the apparatus further includes:
and the output module is used for responding to the condition that the comment information is not matched with any one or more preset modes in the at least one preset mode, and outputting an audit result notification message that the comment information is not audited, wherein the audit result notification message that the comment information is not audited comprises information that is not matched with the at least one preset mode in the comment information.
Optionally, in any one of the apparatus embodiments of the present disclosure, the apparatus further includes: an acquisition module for
In response to the detection of a piece of newly added comment information, taking the item targeted by the newly added comment information as a target item, taking the newly added comment information as comment information of the target item, acquiring basic information of the target item from an item database, and generating and sending the audit request message; alternatively, the first and second electrodes may be,
according to a preset auditing period, respectively taking a project as a target project, sequentially selecting a piece of newly-added comment information of the target project in the auditing period as comment information of the target project, acquiring basic information of the target project from a project database, and generating and sending auditing request information.
Optionally, in any one of the apparatus embodiments of the present disclosure above, the at least one preset mode includes any one or more of: the consistency mode of the comment information and the basic information does not exist, the mode of the limiting keywords does not exist, the mode accords with the preset logic combination mode, and the mode of repeatedly describing information does not exist;
the pattern matching module comprises any one or more of the following units: the first auditing unit, the second auditing unit, the third auditing unit and the fourth auditing unit; wherein:
the first auditing unit is used for matching the comment information with the basic information so as to determine whether information inconsistent with the basic information exists in the comment information; if the comment information does not have information inconsistent with the basic information, the comment information is matched with the basic information in a consistency mode;
the second auditing unit is used for identifying whether the comment information comprises a keyword in a first preset keyword table or not; the first preset keyword table comprises at least one keyword forbidden to be used; if the comment information does not include keywords in a preset keyword table, matching the comment information through the non-occurrence limited keyword mode;
the third auditing unit is used for identifying whether the comment information contains description information which does not accord with a preset logic combination mode; the preset logic combination mode comprises two or more sub-modes with logic relation; if the comment information does not have description information which does not accord with the preset logic combination mode, the comment information is matched through the mode which accords with the preset logic combination mode;
the fourth auditing unit is used for identifying whether repeated description information aiming at different dimensions exists in the comment information; if repeated description information aiming at different dimensions does not exist in the comment information, the comment information is matched through the mode of the non-existent repeated description information;
the comment information is matched through the at least one preset pattern, and the comment information comprises: the comment information is matched through all preset patterns in the at least one preset pattern.
Optionally, in an embodiment of any one of the apparatuses of the present disclosure, the first auditing unit is specifically configured to:
extracting attribute information of each dimension of the target item from the basic information, and extracting description information of at least one dimension of the target item from the comment information; the basic information of the target item comprises attribute information of each dimension of the target item, and the comment information of the target item comprises description information of at least one dimension of the target item;
and comparing whether the description information of at least one dimension has information inconsistent with the attribute information of the corresponding dimension in the basic information.
Optionally, in an embodiment of any one of the apparatuses of the present disclosure, the third auditing unit is specifically configured to:
extracting description information of at least one dimension for the target item from the comment information;
and identifying whether the description information of the at least one dimension has description information which does not accord with a preset logic combination mode.
Optionally, in an embodiment of any one of the apparatuses of the present disclosure, the fourth auditing unit is specifically configured to:
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; wherein the presence of repeated description information for different dimensions in the comment information includes: the hash values of the description information of different dimensions are the same;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information; segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; calculating the hashed values of the multiple segments of each dimension respectively, and calculating the Jacard Jaccard coefficients between the hashed values of the multiple segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; wherein the presence of repeated description information for different dimensions in the comment information includes: the condition that the Jaccard coefficients of different dimensions are larger than a preset similarity threshold exists;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; if the situation that the hash values of the description information of different dimensions are the same does not exist, segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; respectively calculating hash values of a plurality of segments of each dimension, and calculating Jaccard coefficients between the hash values of a plurality of segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; the existence of repeated description information aiming at different dimensions in the comment information comprises the following steps: there are cases where the hash values of the description information for different dimensions are the same, or there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
Optionally, in any one of the apparatus embodiments of the present disclosure, the apparatus further includes:
the first feature extraction module is used for extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature; the second preset keyword table comprises at least one keyword for limiting use;
the second feature extraction module is used for extracting feature information of each keyword in a third preset keyword table from the comment information to obtain a second feature; the third preset keyword table comprises at least one error information keyword;
the machine learning model is specifically configured to obtain a probability value of the comment information belonging to the issuable information based on the first feature and the second feature.
Optionally, in an embodiment of any one of the above apparatuses of the present disclosure, the first feature extraction module is specifically configured to: acquiring keywords in a second preset keyword table included in the comment information to obtain first target keywords; respectively aiming at each first target keyword, obtaining the product of the word frequency TF of the first target keyword in the comment information and the inverse document frequency IDF of the first target keyword to obtain the word frequency-inverse document frequency TF-IDF value of the first target keyword; the first feature comprises a TF-IDF value of the first target keyword;
and/or the presence of a gas in the gas,
the second feature extraction module is specifically configured to: acquiring keywords in a third preset keyword table included in the comment information to obtain second target keywords; respectively aiming at each second target keyword, obtaining the product of TF of the second target keyword in the comment information and IDF of the second target keyword to obtain TF-IDF value of the second target keyword; the first feature includes a TF-IDF value of the second target keyword.
Optionally, in any one of the apparatus embodiments of the present disclosure, the apparatus further includes:
the third feature extraction module is used for extracting features of the description information of each dimension of the target item from the comment information respectively to obtain at least one dimension feature;
the fourth feature extraction module is used for extracting features of the basic attribute information of the target project from the basic information to obtain project attribute features;
the machine learning model is specifically configured to obtain a probability value that the comment information belongs to the issuable information based on the first feature, the second feature, the at least one dimension feature, and the item attribute feature.
Optionally, in any apparatus embodiment of the present disclosure above, the basic attribute information includes: city information and project type information; and/or the presence of a gas in the gas,
the third feature extraction module is specifically configured to:
respectively aiming at each dimension of the target item, extracting the characteristics of each word in the description information from the comment information;
respectively aiming at each dimension of the target item, acquiring preset high-frequency words matched with the features of each word in the description information as the features of the corresponding dimension based on the similarity between the features of each word in the description information and the features of each preset high-frequency word of the corresponding dimension; the at least one dimensional feature comprises: and the matched preset high-frequency words corresponding to all dimensions of the target item.
Optionally, in any of the above apparatus embodiments of the present disclosure, the machine learning model includes: and (5) gradient lifting iterative decision tree GDBT model.
In another aspect of the disclosed embodiments, an electronic device is provided, including:
a memory for storing a computer program;
a processor for executing the computer program stored in the memory, and the computer program, when executed, implements the method of any of the above embodiments of the present disclosure.
In yet another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the method according to any of the above embodiments of the present disclosure.
Based on the information processing method and device, the electronic device, and the storage medium provided by the embodiments of the present disclosure, after receiving the audit request message, at least one preset pattern matching is performed on the comment information in the audit request message based on the basic information in the audit request message, if the comment information passes through the at least one preset pattern matching, a probability value that the comment information belongs to issuable information is obtained further through a machine learning model based on at least one feature extracted from the comment information, and then whether the comment information belongs to issuable information is determined based on the probability value. Therefore, the embodiment of the disclosure realizes the comprehensive examination and accurate judgment of the comment information of the target item in the aspects of pattern matching and semantics based on the basic information of the target item and by combining a pattern matching mode and a machine learning model, can effectively guarantee the objectivity, accuracy and high quality of the comment of the target item, and avoids the situations of inconsistency with the actual situation of the item, wrong information or invalid information.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
fig. 1 is a flowchart of an embodiment of an information processing method according to the present disclosure.
Fig. 2 is a flowchart of another embodiment of the information processing method of the present disclosure.
Fig. 3 is a flowchart of another embodiment of the information processing method of the present disclosure.
Fig. 4 is a schematic structural diagram of an embodiment of an information processing apparatus according to the present disclosure.
Fig. 5 is a schematic structural diagram of another embodiment of the information processing apparatus according to the present disclosure.
Fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.
It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.
In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Fig. 1 is a flowchart of an embodiment of an information processing method according to the present disclosure. As shown in fig. 1, the information processing method of this embodiment includes:
and 102, receiving an audit request message, wherein the audit request message comprises comment information and basic information of the target item.
The items in the embodiment of the present disclosure may be any goods, products, services, and the like, and the embodiment of the present disclosure does not limit the specific types of the items, and the target item is an item targeted by the current audit request message.
The basic information may include attribute information of each dimension of the item, where a dimension may be understood as an index of the item, and attribute information of one dimension may be understood as a parameter value of one index, for example, when the item is a house source, each dimension of the item may include: area, orientation, house type, unit price, etc., and the attribute information of the corresponding dimension may include: specific area size, specific orientation, specific house type, unit price value, etc., which are not limited by the embodiments of the present disclosure.
The comment information may be comment information issued by a service person or a user of the project according to characteristics of the project, and may include, but is not limited to: description information for each dimension of the item, advantages and disadvantages of the item, trading limitations of the item, policy information, and the like.
And 104, performing at least one preset pattern matching on the comment information based on the basic information.
Through this operation 104, the comment information is matched with at least one preset pattern, and it is determined whether there is any content in the comment information that is not matched with any one of the at least one preset pattern. If the comment information contains content which is not matched with any one mode in the at least one preset mode, the comment information does not pass the at least one preset mode matching; otherwise, if the comment information is matched with all the modes in the at least one preset mode, the comment information is matched through the at least one preset mode.
And 106, responding to the comment information matched through the at least one preset pattern, and obtaining a probability value of the comment information belonging to the publishable information based on at least one feature extracted from the comment information through a machine learning model.
The machine learning model is a machine learning model trained in advance based on training samples, and can predict and output the probability value of the information belonging to the publishable information based on the input information. After the machine learning model is trained, when the auditing standard of the comment information changes, the machine learning model can be subjected to supplementary training based on a new training sample or a supplementary training sample which is consistent with the updated auditing standard so as to meet the requirement of the new auditing standard, and the probability value of the comment information belonging to the issuable information is predicted based on the new auditing standard.
And 108, determining whether the comment information belongs to publishable information or not based on the probability value.
According to the information processing method provided by the embodiment of the disclosure, after the audit request message is received, at least one preset pattern matching is performed on the comment information in the audit request message based on the basic information in the audit request message, if the comment information is matched through at least one preset pattern, a probability value that the comment information belongs to the issuable information is obtained through a machine learning model based on at least one feature extracted from the comment information, and then whether the comment information belongs to the issuable information is determined based on the probability value. Therefore, the embodiment of the disclosure realizes the comprehensive examination and accurate judgment of the comment information of the target item in the aspects of pattern matching and semantics based on the basic information of the target item and by combining a pattern matching mode and a machine learning model, can effectively guarantee the objectivity, accuracy and high quality of the comment of the target item, and avoids the situations of inconsistency with the actual situation of the item, wrong information or invalid information.
Fig. 2 is a flowchart of another embodiment of the information processing method of the present disclosure. As shown in fig. 2, the information processing method of this embodiment includes:
202, receiving an audit request message, wherein the audit request message includes comment information and basic information of the target item.
And 204, performing at least one preset pattern matching on the comment information based on the basic information.
If the comment information is not matched by any one or more preset patterns of the at least one preset pattern, operation 206 is performed. If the comment information is matched through the at least one preset pattern, operation 208 is performed.
206, outputting an audit result notification message that the comment information fails to be audited, where the audit result notification message that fails to pass audit includes information that does not match the at least one preset mode in the comment information, and the information that does not match the at least one preset mode is a specific reason why the comment information fails to be audited.
Optionally, in some possible implementation manners, after the review result notification message that the review information fails to be reviewed is output through operation 206, an audit result notification message that the review information fails to be reviewed may be returned to the sender of the review request message, so that the sender does not issue the review information, and further, the review result notification message that the review information fails to be reviewed may be output to the submitting user of the review message.
And 208, obtaining a probability value of the comment information belonging to the publishable information based on at least one feature extracted from the comment information through a machine learning model.
210, determining whether the comment information belongs to publishable information based on the probability value.
Optionally, in some possible implementations, in operation 210, it may be determined whether the comment information belongs to publishable information by comparing whether a probability value of whether the comment information belongs to publishable information is greater than a preset threshold (e.g., 0.6), for example, if the probability value of whether the comment information belongs to publishable information is greater than the preset threshold, it is determined that the comment information belongs to publishable information, otherwise, it is determined that the comment information does not belong to publishable information.
Optionally, in some possible implementations, after determining whether the comment information belongs to the issuable information in operation 108 or 210, an audit result notification message indicating whether the comment information belongs to the issuable information may be returned to the sender of the audit request message, so that the sender issues the comment information when the comment information belongs to the issuable information, and does not issue the comment information when the comment information does not belong to the issuable information, and may further output an audit result notification message to a submitting user of the comment message.
In the process of implementing the present disclosure, through research, the inventor of the present disclosure finds that comment information published on a project needs to meet certain requirements, for example, in a comment on a house source, a broker comment on the house source cannot present information inconsistent with basic information of a house, biased language, and the like. In this embodiment, at least one preset pattern (pattern) may be set in advance for the requirement, and then, based on a pattern matching method, whether the comment information matches with the at least one preset pattern is checked based on basic information of the target item, so that comment information that does not meet the requirement may be identified in a pattern matching manner, and a reason that the comment information does not pass the check is given in a manner of outputting information that does not match with the at least one preset pattern in the comment information. Due to the complexity of Chinese language, some semantic problems cannot be solved well by a mode matching-based method, and if the conditions set by a preset mode are strict, information which does not meet requirements in the comment information can not be screened out comprehensively, so that the comment information cannot be checked effectively; if the conditions set by the preset mode are wide, some information meeting the requirements can be mistakenly screened from the comment information, and therefore the accuracy is low. The method aims at the semantic problem which cannot be solved by a mode matching-based method, whether comment information which passes the mode matching belongs to publishable information or not is predicted by a machine learning model, if the comment information belongs to publishable information, the comment information passes audit, otherwise, if the comment information does not belong to publishable information, the comment information does not pass audit, and the comment information can be effectively audited semantically by utilizing the strong semantic learning capability of the machine learning model. According to the embodiment of the disclosure, the mode is combined with the machine learning model, so that the review information can be comprehensively, effectively and accurately reviewed, and the contents which do not pass the review can be given in a mode of mode matching as the reasons for the non-passing review, so that the user or the service personnel who issue the review information can know the specific reasons for the non-passing review and adjust the contents of the review information in time.
Optionally, before the embodiment shown in fig. 1 or fig. 2, the method may further include: in response to the detection of a piece of newly added comment information, taking the item targeted by the newly added comment information as a target item, taking the newly added comment information as comment information of the target item, acquiring basic information of the target item from an item database, and generating and sending the audit request message. Therefore, real-time review of the comment information can be realized when the comment information is published by the user or the service personnel.
Alternatively, before the embodiment shown in fig. 1 or fig. 2, the method may further include: according to a preset auditing period, respectively taking a project as a target project, sequentially selecting a piece of newly-added comment information of the target project in the auditing period as comment information of the target project, acquiring basic information of the target project from a project database, and generating and sending auditing request information. Therefore, after the user or the service personnel publish the comment information, the periodical review of the comment information can be realized.
Optionally, in some possible implementations, the at least one preset mode may include, for example and without limitation, any one or more of the following: and a consistency mode of the comment information and the basic information does not exist, a limited keyword mode does not exist, a preset logic combination mode is met, a repeated description information mode does not exist, and the like. Accordingly, in operation 104 or 204, performing preset pattern matching on the comment information based on the basic information may include any one or more of the following:
checking whether information inconsistent with the basic information exists in the comment information; if the comment information does not have information inconsistent with the basic information, matching the comment information with the basic information in a consistency mode;
matching the comment information with the basic information to determine whether the comment information comprises a keyword in a first preset keyword table; the first preset keyword table comprises at least one keyword forbidden to be used; and if the comment information does not comprise the keywords in the preset keyword table, matching the comment information through the non-occurrence limited keyword mode. The keywords prohibited to be used in the first preset keyword list may be some keywords that are not allowed to appear and are preset according to actual needs, for example, words with special meanings and specific names (for example, company names of competitors), guidance comments (for example, seeing a house to contact with me in advance, a key in the hands of me), professional omission languages (for example, whether there are no private camps), information unrelated to the project (for example, a public fund is substituted), and an uncertainly term, and the first preset keyword list may be updated in real time according to actual needs. By matching the comment information with a first preset keyword table, keywords prohibited to be used in the comment information can be accurately matched;
identifying whether the comment information contains description information which does not accord with a preset logic combination mode; the preset logic combination mode comprises two or more sub-modes with logic relation; if the comment information does not have description information which does not accord with the preset logic combination mode, the comment information is matched through the mode which accords with the preset logic combination mode;
identifying whether repeated description information aiming at different dimensions exists in the comment information; and if the repeated description information aiming at different dimensions does not exist in the comment information, the comment information is matched through the mode of the nonexistence repeated description information.
In the above implementation manner, when the comment information is matched through all the preset patterns in the at least one preset pattern, the comment information is considered to be matched through the at least one preset pattern. Otherwise, if the comment information is not matched with any one of the at least one preset pattern, the comment information is considered not to be matched with the at least one preset pattern.
Based on the embodiment, whether content inconsistent with basic information appears in the comment information, whether a limiting keyword appears, whether the comment information accords with a preset logic combination mode, whether repeated description information exists in the comment information, and the like can be audited, and if any one of the conditions appears, the comment information cannot pass the audit.
In an optional example, when the comment information is matched with the basic information to determine whether information inconsistent with the basic information exists in the comment information, attribute information of each dimension of a target item may be extracted from the basic information, and description information of at least one dimension of the target item may be extracted from the comment information, where the basic information of the target item includes the attribute information of each dimension of the target item, and the comment information of the target item includes the description information of at least one dimension of the target item; then, whether information inconsistent with attribute information of the corresponding dimension in the basic information exists in the description information of the at least one dimension is compared. Therefore, the description information of each dimension in the comment information can be compared with the attribute information of the corresponding dimension in the basic information, and whether the content inconsistent with the basic information appears in the comment information or not can be confirmed.
In an optional example, when identifying whether there is description information that does not conform to the preset logical combination mode in the comment information, the description information of at least one dimension for the target item may be extracted from the comment information, and then, whether there is description information that does not conform to the preset logical combination mode in the description information of at least one dimension may be identified. The preset logical combination patterns may include a plurality of preset logical combination patterns, and each of the plurality of logical combination patterns includes one or more patterns corresponding to a logical relationship (e.g., a causal relationship, a progressive relationship, a parallel relationship, etc.).
Because some viewpoints need to satisfy multiple conditions or be represented by different conditions, for example, near-subway houses, the modes can be screened through a mode of a near subway, can be screened through a distance from the subway, and can be screened through the two modes at the same time, the two modes can be logically combined to form a group of logical combination modes, for example, the modes can be represented as ' being close to the subway ', ' being less than 1000 meters away from a subway station ', ' being less than 1000 meters away from the subway station ', being close to the subway ', and the like. Therefore, the description information of at least one dimension of the target item can be extracted from the comment information to identify whether the description information which does not accord with the preset logic combination mode exists or not, and whether the description information which does not accord with the preset logic combination mode exists or not is confirmed.
In an optional example, when identifying whether there is repeated description information for different dimensions in the comment information, the description information for each dimension of the target item may be extracted from the comment information, hash values of the description information for each dimension may be calculated, and whether there is a case where the hash values of the description information for different dimensions are the same or not may be compared. In this optional example, the presence of repeated description information for different dimensions in the comment information includes: there are cases where hash values of description information for different dimensions are the same.
The Hash value (Hash value) of the description information of each dimension may be calculated by using Hash functions (Hash functions) such as the Message Digest 5(Message Digest 5, MD5), the Secure Hash Algorithm 1 (SHA 1), the Cyclic Redundancy Check (CRC), and the like.
Based on the embodiment, whether the comment information has the completely same description information for different dimensions can be identified by comparing whether the hash values of the description information for different dimensions are the same, so that the situation that the description information for different dimensions in the comment information is completely repeated is avoided.
Or, in another optional example, when whether repeated description information for different dimensions exists in the comment information is checked, the description information for each dimension of the target item may be extracted from the comment information, the description information for each dimension is segmented according to a preset mode to obtain a plurality of segments for each dimension, then hash values of the segments for each dimension are calculated respectively, Jaccard coefficients between the hash values of the segments for any two different dimensions are calculated, and then whether Jaccard coefficients for different dimensions are greater than a preset similarity threshold is compared, and if Jaccard coefficients for different dimensions are greater than the preset similarity threshold, description information for two dimensions with Jaccard coefficients greater than the preset similarity threshold is considered to be repeated; otherwise, if the condition that the Jaccard coefficients of different dimensions are larger than the preset similarity threshold value does not exist, the repeated description information aiming at the different dimensions does not exist in the comment information. In this optional example, the presence of repeated description information for different dimensions in the comment information includes: there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
The description information of each dimension of the target item may be calculated through hash functions such as MD5, SHA1, CRC, and the like, and the embodiment of the present disclosure does not limit the specifically adopted hash function.
The Jaccard coefficient is also called Jaccard similarity coefficient (Jaccard similarity coefficient) and is used for comparing similarity and difference between limited sample sets. The larger the Jaccard coefficient value, the higher the sample similarity. Given the two sets A, B, the Jaccard coefficient is defined as the ratio of the size of the intersection of A and B to the size of the union of A and B.
In this embodiment, the description information of each dimension is segmented, the hash values of the segments of each dimension are calculated respectively, the hash values of the segments of each dimension are used as a set, and the Jaccard coefficients between the hash values of the segments of any two different dimensions are obtained by calculating the ratio of the size of the intersection of the sets of any two different dimensions to the size of the union, so that whether the description information of different dimensions is repeated is identified based on whether the Jaccard coefficients of different dimensions are greater than a preset similarity threshold.
Based on the embodiment, the condition that the description information parts aiming at different dimensions in the comment information are the same can be identified by comparing whether the Jaccard coefficients of different dimensions are larger than the preset similarity threshold value, so that the condition that the description information parts aiming at different dimensions in the comment information are repeated is avoided.
Or, in yet another optional example, when whether repeated description information for different dimensions exists in the comment information is checked, the description information for each dimension of the target item may be extracted from the comment information, hash values of the description information for each dimension are calculated respectively, and whether the hash values of the description information for different dimensions are the same or not is compared; if the situation that the hash values of the description information of different dimensions are the same does not exist, segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension, then respectively calculating the hash values of the segments of each dimension, calculating the Jaccard coefficients between the hash values of the segments of any two different dimensions, and further comparing whether the situation that the Jaccard coefficients of different dimensions are larger than a preset similarity threshold exists or not. In this optional example, the presence of repeated description information for different dimensions in the comment information includes: there are cases where the hash values of the description information for different dimensions are the same, or there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
Based on the embodiment, whether the situation that the description information of different dimensions is completely the same exists in the comment information or not can be identified by comparing whether the hash values of the description information of different dimensions are the same or not, and if the situation that the description information of different dimensions is not completely the same exists, the situation that the description information of different dimensions is partially the same in the comment information can be identified by comparing whether the Jaccard coefficients of different dimensions are larger than the preset similarity threshold or not, so that the situation that the description information of different dimensions is completely the same and partially the same in the comment information can be comprehensively identified.
Fig. 3 is a flowchart of another embodiment of the information processing method of the present disclosure. As shown in fig. 3, on the basis of any of the above embodiments, 106 or 208 may include:
302, extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature.
And the second preset keyword table comprises at least one keyword for limiting use. The keywords in the second preset keyword table may be some keywords that are not allowed to be used or are not allowed to be used in some cases (e.g., company names of competitors, commercials, and non-civilized terms) preset according to actual requirements, and the keywords in the second preset keyword table may be the same as or partially the same as or different from the keywords in the first preset keyword table in the above embodiment, which is not limited in this embodiment.
And 304, extracting feature information of each keyword in a third preset keyword table from the comment information to obtain a second feature.
And the third preset keyword table comprises at least one error information keyword. The error information keyword in the third preset keyword table is a keyword extracted from error corpora (such as a telephone number, an extranet link, a wrongly written word, a sentence with a language disorder, etc.) appearing in the comment information and summarized from the actual service.
And 306, obtaining a probability value of the comment information belonging to the publishable information based on the first characteristic and the second characteristic through a machine learning model.
In this embodiment, the probability value that comment information belongs to issuable information can be accurately determined from the semantic aspect according to the feature information of each keyword in the second preset keyword table and the feature information of each keyword in the third preset keyword table extracted from the comment information by using the strong semantic learning and prediction capabilities of the machine learning model.
Optionally, in some possible implementation manners, in 302, a keyword in a second preset keyword table included in the comment information may be obtained to obtain a first target keyword, and then, for each first target keyword, a product of a Term Frequency (TF) of the first target keyword in the comment information and an Inverse Document Frequency (IDF) of the first target keyword is obtained to obtain a Term Frequency-Inverse Document Frequency (TF-IDF) value of the first target keyword, where the first feature includes the TF-IDF value of the first target keyword.
The term frequency refers to the number of times a given term appears in a document, the inverse document frequency is a measure of the general importance of a term, namely, the importance weight, the size of which is inversely proportional to the degree of common use of a term, the calculation method is that the total number of documents in the corpus is divided by the number of documents containing the term in the corpus, and then the obtained quotient is logarithmized. Multiplying the word frequency of the first target keyword in the comment information by the inverse document frequency of the first target keyword to obtain a TF-IDF value, wherein the TF-IDF value represents the importance of the first target keyword to the document where the first target keyword is located, and the larger the TF-IDF value is, the higher the importance of the first target keyword to the comment information document is. And if the comment information does not include the keywords in the second preset keyword table, the word frequency of the first target keyword is 0, and the TF-IDF value of the first target keyword is 0 correspondingly. In a specific implementation, when the TF-IDF value of the first target keyword is greater than a preset threshold, the comment information may be considered to be not approved. If a plurality of first target keywords are included in the comment information, the sum or average of the TF-IDF values of the plurality of first target keywords may be used as the final TF-IDF value of the first target keyword included in the comment information.
Optionally, in some possible implementation manners, in 304, a keyword in a third preset keyword table included in the comment information may be obtained to obtain a second target keyword; and respectively obtaining the product of TF of the second target keyword in the comment information and IDF of the second target keyword aiming at each second target keyword to obtain TF-IDF value of the second target keyword, wherein the first characteristic comprises the TF-IDF value of the second target keyword.
Similarly, the number of times that each second target keyword appears in the comment information is obtained as the word frequency of the second target keyword, the total number of documents in the corpus is divided by the number of documents in the corpus containing the second target keyword, and the obtained quotient is logarithmized to obtain the inverse document frequency of the second target keyword. Multiplying the word frequency of the second target keyword in the comment information by the inverse document frequency of the second target keyword to obtain a TF-IDF value, wherein the TF-IDF value represents the importance of the second target keyword to the document where the second target keyword is located, and the larger the TF-IDF value is, the higher the importance of the second target keyword to the comment information document is. And if the comment information does not include the keyword in the third preset keyword table, the word frequency of the second target keyword is 0, and the TF-IDF value of the second target keyword is 0 correspondingly. In a specific implementation, when the TF-IDF value of the second target keyword is greater than a preset threshold, the comment information may be considered to be not approved. If a plurality of second target keywords are included in the comment information, the sum or average of the TF-IDF values of the plurality of second target keywords may be used as the final TF-IDF value of the second target keyword included in the comment information.
Therefore, the probability value of whether the review information passes the review can be predicted based on the TF-IDF values of the keywords in the second preset keyword table and the third preset keyword table, for example, when the final TF-IDF value of the first target keyword included in the review information and/or the final TF-IDF value of the second target keyword included in the review information is greater than a preset threshold, the review information is considered not to belong to the publishable information, and therefore the review is not passed, and therefore accurate and objective evaluation of whether the review information passes the review can be achieved.
In addition, referring back to fig. 3, in a further embodiment of the information processing method of the present disclosure, 106 or 208 may further include:
308, extracting the feature of the description information of each dimension of the target item from the comment information respectively to obtain at least one dimension feature.
And 310, extracting the characteristics of the basic attribute information of the target item from the basic information to obtain item attribute characteristics.
Some basic attributes of the target item can be determined based on the item attribute features, and whether error information which does not accord with the basic attributes of the item exists in the comment information can be determined through whether the features of the description information of each dimension extracted from the comment information are matched with the item attribute features. If the characteristics of the description information of each dimension extracted from the comment information are not matched with the attribute characteristics of the item, error information which is not matched with the basic attribute of the item appears in the comment information, and the comment information does not belong to publishable information and cannot pass the audit.
Specifically, by predicting the matching degree between the feature of the description information of each dimension extracted from the comment information and the item attribute feature, when the matching degree is smaller than a preset matching degree threshold, the comment information does not belong to the issuable information and therefore cannot pass the audit. Accordingly, in operation 306, a probability value that the comment information belongs to the issuable information is predicted based on the first feature, the second feature, the at least one dimension feature, and the item attribute feature via a machine learning model.
In this embodiment, the strong semantic learning and prediction capabilities of the machine learning model can be utilized, and according to the feature information of each keyword in the second preset keyword table, the feature information of each keyword in the third preset keyword table, the project attribute feature and the feature of the description information of each dimension in the comment information extracted from the comment information, the comment information can be comprehensively reviewed from the aspect of semantics.
Optionally, in some possible implementation manners, in operation 308, features of words in the description information may be extracted from the comment information respectively for each dimension of the target item, and then, for each dimension of the target item, based on a similarity between the features of the words in the description information and features of preset high-frequency words in the corresponding dimension, a preset high-frequency word that matches the features of the words in the description information (that is, the similarity is highest, or the similarity is higher than a preset similarity threshold) is obtained as the feature of the corresponding dimension. Wherein the at least one dimensional feature comprises: and the matched preset high-frequency words correspond to all dimensions of the target item.
Optionally, in some possible implementation manners, in operation 310, the basic attribute information may include, for example: city information and item type information, based on which some relevant policies for the target item may be determined. For example, when the item is a house property, the item type information may be house type information of a house, a commercial building, a commercial housing, and the like, and in combination with the city information house type information, the relevant transaction policy of the current house source (i.e., the target item) may be determined, such as whether to limit purchasing, qualification for purchasing, tax standard, transaction time limit, and the like.
In some alternative examples, the characteristics of the city information and the characteristics of the item type information may be represented by a unique (one-hot) code, where the one-hot code is composed of 0 and 1, for example, for the city information, the register bit belonging to the city in the target item is 1, and the other register bits are 0; the feature of the item type information represents the same.
Based on the embodiment, the preset high-frequency words matched with the features of the words in the description information are obtained as the features of the corresponding dimensions, and whether the error information inconsistent with the basic attributes of the items appears in the comment information can be quickly and accurately determined by comparing whether the preset high-frequency words are matched with the features of the basic attribute information. And when the characteristics of the preset high-frequency words and the basic attribute information are not matched, determining that the comment information does not belong to the issuable information, and thus failing to pass the audit.
Therefore, whether the comment information belongs to the issuable information or not can be determined through whether the TF-IDF value of the first target keyword is larger than the preset threshold value or not, whether the TF-IDF value of the second target keyword is larger than the preset threshold value or not, and whether the characteristics of the preset high-frequency word and the basic attribute information are matched or not.
Optionally, in some possible implementations, a first probability value that the TF-IDF value of the first target keyword is not greater than a preset threshold, a second probability value that the TF-IDF value of the second target keyword is not greater than a preset threshold, and a third probability value that the preset high-frequency word matches with the feature of the basic attribute information may be predicted, and a total probability value obtained by summing the first probability value, the second probability value, and the third probability value is taken as a probability value that the comment information belongs to the issuable information.
Or, in other possible implementations, a first probability value that the TF-IDF value of the first target keyword is greater than a preset threshold, a second probability value that the TF-IDF value of the second target keyword is greater than a preset threshold, and a third probability value that the preset high-frequency word is not matched with the feature of the basic attribute information may be predicted, a total probability value obtained by a sum of the first probability value, the second probability value, and the third probability value is used as a probability value that the comment information does not belong to the issuable information, and a probability value that the comment information belongs to the issuable information may be obtained by a probability value that the comment information does not belong to the issuable information, for example, a difference between 1 and a probability value that the normalized comment information does not belong to the issuable information is used as a probability value that the comment information belongs to the issuable information.
Optionally, in some possible implementations, in order to avoid that some words without discriminant meaning, for example, words with the largest occurrence frequency in the document, "yes", "in", and the like, affect the prediction result of the machine learning model and improve the prediction efficiency, the words without discriminant meaning may be added into a stop word (stop words) table, words in the stop words are filtered out in advance from the comment information and the basic attribute information of the target item, and then the first feature, the second feature, the at least one dimension feature, and the item attribute feature are extracted based on the filtered comment information and the basic attribute information of the target item and input into the machine learning model for prediction.
Optionally, in some possible implementations, the Machine learning model in the above embodiments may be implemented by any classification model that can implement text classification, for example, a gradient boosting iterative decision tree (GDBT) model, a Support Vector Machine (SVM), a fast text classifier (FastText), a k-nearest model, a multi-layer perceptron, a naive bayes (including bernoulli bayes, gaussian bayes, and polynomial bayes), a random forest, a feedforward neural network, a Long Short Term Memory (LSTM) model, and the like, or other classification models.
The GDBT (Gradient Boosting Decision Tree) model adopts an iterative Decision Tree algorithm and consists of a plurality of Decision Trees, each Decision Tree is formed in a feature classification process, and the final prediction result is determined by the classification results of all the Decision Trees.
Optionally, in some possible implementation manners, three weak classifiers may be used to classify the first feature, the second feature, the at least one dimension feature, and the item attribute feature, and a first weak classifier is used to classify whether the first feature is smaller than a preset threshold value, so as to obtain a first classification result; classifying whether the second characteristic is smaller than a preset threshold value by a second weak classifier to obtain a second classification result; classifying the matching degree between the at least one dimension characteristic and the item attribute characteristic by a third weak classifier to obtain a third classification result; and then obtaining the probability value of the comment information belonging to the issuable information based on the classification results of the three weak classifiers.
In a specific implementation, the weak classifier is typically chosen as a classification regression TREE (CART TREE).
In the embodiment of the disclosure, each keyword in the second preset keyword list, each keyword in the third preset keyword list, and description information of each dimension and basic information of a project in multiple matched comment information may be used as a training sample in advance, a training sample set may be constructed, features of each sample in the training sample set may be extracted to perform iterative training on the GDBT model until a preset training completion condition is satisfied, and the obtained GDBT model may predict a probability value that an input object belongs to issuable information based on features (the first feature, the second feature, features of the description information of each dimension in the comment information, and project attribute features) extracted from the input object (i.e., a piece of information, such as the comment information), so as to determine whether the input object belongs to the issuable information based on the issuable information.
The GDBT model is an algorithm for classifying or regressing data by using an additive model (i.e. a linear combination of basis functions) and continuously reducing residual errors generated in a training process.
For example, in one possible implementation, the GDBT model may be trained as follows:
the first step is as follows: when the GDBT model is trained, a classification regression tree is trained for each possible classification of each sample x in a training sample set. If the current training sample set has three types, that is, K is 3, and the sample x only belongs to the second type, the classification result for the sample x may be represented by a three-dimensional vector [0,1,0], where 0 represents that the sample x does not belong to the class, and 1 represents that the sample x belongs to the class, and since the sample x only belongs to the second type, the vector dimension corresponding to the second type is 1, and the vector dimensions corresponding to the other classes are 0. If the sample x belongs to the first class and the second class only at the same time, the classification result for the sample x can be represented by a three-dimensional vector [1,1,0 ]. If the sample x belongs to the second class and the third class only, the classification result for the sample x can be represented by a three-dimensional vector [0,1,1], and so on.
Aiming at the condition that the sample has three types (namely, the samples belong to each keyword in a second preset keyword list, each keyword in a third preset keyword list, and description information of each dimension and basic information of the item in a plurality of matched comment information), three classification regression trees are trained simultaneously during each round of training, namely, the parameters of the three classification regression trees are adjusted. Taking the example that the sample x only belongs to the second class, the first classification regression tree is (x,0) for the first class of the sample x, the second classification regression tree is (x,1) for the second class of the sample x, and the third classification regression tree is (x,0) for the third class of the sample x.
After training the sample x, three classification regression trees are generated, and the prediction values of the sample x are f11(x)、 f22(x)、f33(x) Obtaining residuals aiming at the first class, the second class and the third class respectively as follows:
y11=0-f11(x)
y22=1-f22(x)
y33=0-f33(x)
then a second round of training is started, with (x, y) for the first class of inputs11(x) For the second class of input is (x, y)22(x) For the third class of inputs is (x, y)33(x) And continuing to train three classification regression trees.
And (4) executing M rounds of training in an iterative mode all the time, and constructing three classification regression trees in each round of training.
The process of processing the input object by the classification regression tree to output the predicted value is consistent with the process of processing the first feature, the second feature, the feature of the description information of each dimension and the project attribute feature in the above embodiment to output the processing result, and details are not repeated here.
When training is completed, an input object x comes1By the features of (1), we need to predict the input object x1When the category of (2) is being determined, three predicted values f are generated1(x)、f2(x)、f3(x) In that respect Since object x is input1When one or more conditions are satisfied, the three predicted values f belonging to the first class, the second class and the third class, which do not belong to the third class, cannot pass the audit, so that the three predicted values f are obtained1(x)、f2(x)、f3(x) Respectively normalized to [0, 1]]Mapping value f within interval10(x)、 f20(x)、f30(x) Inputting an object x1The probability value P passing the audit can be expressed as: p ═ 1-f10(x))+(1- f20(x))+f30(x) In that respect Based on the predicted probability value Pc(x) It may be determined whether the input object belongs to issuable information.
Any of the information processing methods provided by the embodiments of the present disclosure may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any of the information processing methods provided by the embodiments of the present disclosure may be executed by a processor, for example, the processor may execute any of the information processing methods mentioned in the embodiments of the present disclosure by calling a corresponding instruction stored in a memory. And will not be described in detail below.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 4 is a schematic structural diagram of an embodiment of an information processing apparatus according to the present disclosure. The information processing apparatus of this embodiment can be used to implement the above-described information processing method embodiments of the present disclosure. As shown in fig. 4, the information processing apparatus of this embodiment includes: the device comprises a receiving module, a pattern matching module, a machine learning model and a determining module. Wherein:
the receiving module is used for receiving the auditing request message, wherein the auditing request message comprises the comment information and the basic information of the target project.
And the pattern matching module is used for carrying out at least one preset pattern matching on the comment information based on the basic information.
The machine learning model is used for responding to the comment information and obtaining a probability value of the comment information belonging to the publishable information based on at least one feature extracted from the comment information through the at least one preset pattern matching;
and the determining module is used for determining whether the comment information belongs to publishable information or not based on the probability value.
In some possible implementations, the machine learning model may be implemented using any classification model that can implement text classification, such as a GDBT model, SVM, FastText, k-neighbor model, multi-layer perceptron, naive bayes, random forest, feed-forward neural network, LSTM model, etc., or other classification models.
Based on the information processing device provided by the above embodiment of the present disclosure, after receiving the audit request message, first, at least one preset pattern matching is performed on the comment information in the audit request message based on the basic information in the audit request message, if the comment information passes through the at least one preset pattern matching, a probability value that the comment information belongs to issuable information is obtained further through a machine learning model based on at least one feature extracted from the comment information, and then, whether the comment information belongs to the issuable information is determined based on the probability value. Therefore, the embodiment of the disclosure realizes the comprehensive examination and accurate judgment of the comment information of the target item in the aspects of pattern matching and semantics based on the basic information of the target item and by combining a pattern matching mode and a machine learning model, can effectively guarantee the objectivity, accuracy and high quality of the comment of the target item, and avoids the situations of inconsistency with the actual situation of the item, wrong information or invalid information.
Fig. 5 is a schematic structural diagram of another embodiment of the information processing apparatus according to the present disclosure. As shown in fig. 5, on the basis of the embodiment shown in fig. 4, the information processing apparatus of this embodiment further includes: and the output module is used for responding to the condition that the comment information is not matched with any one or more preset modes in the at least one preset mode, and outputting an audit result notification message that the comment information is not audited, wherein the audit result notification message that the comment information is not audited comprises information that is not matched with the at least one preset mode in the comment information.
In addition, referring back to fig. 5, the information processing apparatus according to the embodiment of the present disclosure may further include: an acquisition module to: in response to the detection of a piece of newly added comment information, taking the item targeted by the newly added comment information as a target item, taking the newly added comment information as comment information of the target item, acquiring basic information of the target item from an item database, and generating and sending the audit request message; or, for: according to a preset auditing period, respectively taking a project as a target project, sequentially selecting a piece of newly-added comment information of the target project in the auditing period as comment information of the target project, acquiring basic information of the target project from a project database, and generating and sending auditing request information.
Optionally, in some possible implementations, the at least one preset mode may include, for example and without limitation, any one or more of the following: and a consistency mode of the comment information and the basic information does not exist, a limited keyword mode does not exist, a preset logic combination mode is met, a repeated description information mode does not exist, and the like. Accordingly, in this implementation, the pattern matching module may include, but is not limited to, any one or more of the following: the first auditing unit, the second auditing unit, the third auditing unit and the fourth auditing unit.
The first auditing unit is used for matching the comment information with the basic information to determine whether information inconsistent with the basic information exists in the comment information; and if the comment information does not have information inconsistent with the basic information, matching the comment information with the basic information in a consistency mode.
The second auditing unit is used for identifying whether the comment information comprises a keyword in a first preset keyword table or not; the first preset keyword table comprises at least one keyword forbidden to be used; and if the comment information does not comprise the keywords in the preset keyword table, matching the comment information through the non-occurrence limited keyword mode.
The third auditing unit is used for identifying whether the comment information contains description information which does not accord with a preset logic combination mode; the preset logic combination mode comprises two or more sub-modes with logic relation; and if the comment information does not have description information which does not accord with the preset logic combination mode, matching the comment information through the corresponding preset logic combination mode.
The fourth auditing unit is used for identifying whether repeated description information aiming at different dimensions exists in the comment information; if repeated description information aiming at different dimensions does not exist in the comment information, the comment information is matched through the mode of the repeated description information which does not exist
The comment information is matched through the at least one preset pattern, and the comment information comprises: the comment information is matched through all preset patterns in the at least one preset pattern.
Optionally, in some optional examples, the first auditing unit is specifically configured to: extracting attribute information of each dimension of the target item from the basic information, and extracting description information of at least one dimension of the target item from the comment information; the basic information of the target item comprises attribute information of each dimension of the target item, and the comment information of the target item comprises description information of at least one dimension of the target item; and comparing whether the description information of at least one dimension has information inconsistent with the attribute information of the corresponding dimension in the basic information.
Optionally, in some optional examples, the third auditing unit is specifically configured to: extracting description information of at least one dimension for the target item from the comment information; and identifying whether the description information of the at least one dimension has description information which does not accord with a preset logic combination mode.
Optionally, in some optional examples, the fourth auditing unit is specifically configured to: extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; wherein the presence of repeated description information for different dimensions in the comment information includes: there are cases where hash values of description information for different dimensions are the same.
Or, in another optional example, the fourth auditing unit is specifically configured to: extracting description information of each dimension of the target item from the comment information; segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; respectively calculating hash values of a plurality of segments of each dimension, and calculating Jaccard coefficients between the hash values of a plurality of segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; wherein the presence of repeated description information for different dimensions in the comment information includes: there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
Or, in some further optional examples, the fourth auditing unit is specifically configured to: extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; if the situation that the hash values of the description information of different dimensions are the same does not exist, segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; respectively calculating hash values of a plurality of segments of each dimension, and calculating Jaccard coefficients between the hash values of a plurality of segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; the existence of repeated description information aiming at different dimensions in the comment information comprises the following steps: there are cases where the hash values of the description information for different dimensions are the same, or there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
In addition, referring back to fig. 5, the information processing apparatus according to the embodiment of the present disclosure may further include: the device comprises a first feature extraction module and a second feature extraction module. The first feature extraction module is used for extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature; and the second preset keyword table comprises at least one keyword for limiting use. The second feature extraction module is used for extracting feature information of each keyword in a third preset keyword table from the comment information to obtain a second feature; and the third preset keyword table comprises at least one error information keyword. Accordingly, in this embodiment, the machine learning model is specifically configured to obtain, based on the first feature and the second feature, a probability value that the comment information belongs to publishable information.
Optionally, in some possible implementation manners, the first feature extraction module is specifically configured to: acquiring keywords in a second preset keyword table included in the comment information to obtain first target keywords; respectively aiming at each first target keyword, obtaining the product of TF of the first target keyword in the comment information and IDF of the first target keyword to obtain TF-IDF value of the first target keyword; the first feature includes a TF-IDF value of the first target keyword.
Optionally, in some possible implementation manners, the second feature extraction module is specifically configured to: acquiring keywords in a third preset keyword table included in the comment information to obtain second target keywords; respectively aiming at each second target keyword, obtaining the product of TF of the second target keyword in the comment information and IDF of the second target keyword to obtain TF-IDF value of the second target keyword; the first feature includes a TF-IDF value of the second target keyword.
In addition, referring back to fig. 5, on the basis of the above embodiment, the information processing apparatus of the present embodiment may further include: a third feature extraction module and a fourth feature extraction module. The third feature extraction module is configured to extract features of the description information of each dimension of the target item from the comment information, respectively, to obtain at least one dimension feature. And the fourth feature extraction module is used for extracting the features of the basic attribute information of the target item from the basic information to obtain the item attribute features. Accordingly, in this embodiment, the machine learning model is specifically configured to obtain, based on the first feature, the second feature, the at least one dimension feature, and the item attribute feature, a probability value that the comment information belongs to the issuable information.
Optionally, in some possible implementation manners, the basic attribute information may include, for example: city information and item type information.
Optionally, in some possible implementation manners, the third feature extraction module is specifically configured to: respectively aiming at each dimension of the target item, extracting the characteristics of each word in the description information from the comment information; respectively aiming at each dimension of the target item, acquiring preset high-frequency words matched with the features of each word in the description information as the features of the corresponding dimension based on the similarity between the features of each word in the description information and the features of each preset high-frequency word of the corresponding dimension; the at least one dimensional feature comprises: and the matched preset high-frequency words corresponding to all dimensions of the target item.
In addition, an embodiment of the present disclosure also provides an electronic device, including:
a memory for storing a computer program;
a processor, configured to execute the computer program stored in the memory, and when the computer program is executed, implement the information processing method according to any of the above embodiments of the present disclosure.
Fig. 6 is a schematic structural diagram of an embodiment of an electronic device according to the present disclosure. Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 6. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.
As shown in fig. 6, the electronic device includes one or more processors and memory.
The processor may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
The memory may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by a processor to implement the information processing methods of the various embodiments of the present disclosure described above and/or other desired functions.
In one example, the electronic device may further include: an input device and an output device, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device may also include, for example, a keyboard, a mouse, and the like.
The output device may output various information including the determined distance information, direction information, and the like to the outside. The output devices may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 6, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.
In addition to the above-described methods and apparatuses, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the information processing method according to various embodiments of the present disclosure described in the above-mentioned part of the specification.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform steps in an information processing method according to various embodiments of the present disclosure described in the above section of the present specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (22)

1. An information processing method characterized by comprising:
receiving an audit request message, wherein the audit request message comprises comment information and basic information of a target item, and the basic information comprises attribute information of each dimension of the target item;
performing at least one preset pattern matching on the comment information based on the basic information;
responding to the comment information matched through the at least one preset pattern, and obtaining a probability value of the comment information belonging to publishable information through a machine learning model based on at least one feature extracted from the comment information;
determining whether the comment information belongs to publishable information based on the probability value;
wherein the obtaining, by the machine learning model, a probability value of whether the comment information belongs to publishable information based on at least one feature extracted from the comment information includes:
extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature; the second preset keyword table comprises at least one keyword for limiting use;
extracting feature information of each keyword in a third preset keyword table from the comment information to obtain a second feature; the third preset keyword table comprises at least one error information keyword;
extracting features of the description information of each dimension of the target item from the comment information respectively to obtain at least one dimension feature;
extracting the characteristics of the basic attribute information of the target project from the basic information to obtain project attribute characteristics;
obtaining, by the machine learning model, a probability value that the comment information belongs to the issuable information based on the first feature, the second feature, the at least one dimension feature, and the item attribute feature.
2. The method of claim 1, further comprising:
and in response to that the comment information is not matched with any one or more preset patterns in the at least one preset pattern, outputting an audit result notification message that the comment information is not audited, wherein the audit result notification message that the comment information is not audited includes information that is not matched with the at least one preset pattern in the comment information.
3. The method of claim 1, wherein before receiving the audit request message, the method further comprises:
in response to the detection of a piece of newly added comment information, taking the item targeted by the newly added comment information as a target item, taking the newly added comment information as comment information of the target item, acquiring basic information of the target item from an item database, and generating and sending the audit request message; alternatively, the first and second electrodes may be,
according to a preset auditing period, respectively taking a project as a target project, sequentially selecting a piece of newly-added comment information of the target project in the auditing period as comment information of the target project, acquiring basic information of the target project from a project database, and generating and sending auditing request information.
4. The method according to any one of claims 1-3, wherein the at least one preset pattern comprises any one or more of: the consistency mode of the comment information and the basic information does not exist, the mode of the limiting keywords does not exist, the mode accords with the preset logic combination mode, and the mode of repeatedly describing information does not exist;
the comment information is subjected to preset pattern matching based on the basic information, and the preset pattern matching comprises any one or more of the following items:
matching the comment information with the basic information to determine whether information inconsistent with the basic information exists in the comment information; if the comment information does not have information inconsistent with the basic information, the comment information is matched with the basic information in a consistency mode;
identifying whether the comment information comprises a keyword in a first preset keyword table; the first preset keyword table comprises at least one keyword forbidden to be used; if the comment information does not include keywords in a preset keyword table, matching the comment information through the non-occurrence limited keyword mode;
identifying whether the comment information contains description information which does not accord with a preset logic combination mode; the preset logic combination mode comprises two or more sub-modes with logic relation; if the comment information does not have description information which does not accord with the preset logic combination mode, the comment information is matched through the mode which accords with the preset logic combination mode;
identifying whether repeated description information aiming at different dimensions exists in the comment information; if repeated description information aiming at different dimensions does not exist in the comment information, the comment information is matched through the mode of the non-existent repeated description information;
the comment information is matched through the at least one preset pattern, and the comment information comprises: the comment information is matched through all preset patterns in the at least one preset pattern.
5. The method of claim 4, wherein the matching the comment information with the base information to determine whether information inconsistent with the base information exists in the comment information comprises:
extracting attribute information of each dimension of the target item from the basic information, and extracting description information of at least one dimension of the target item from the comment information; wherein the review information for the target item includes descriptive information for at least one dimension of the target item;
and comparing whether the description information of at least one dimension has information inconsistent with the attribute information of the corresponding dimension in the basic information.
6. The method of claim 4, wherein the identifying whether the description information which does not conform to the preset logical combination mode exists in the comment information comprises:
extracting description information of at least one dimension for the target item from the comment information;
and identifying whether the description information of the at least one dimension has description information which does not accord with a preset logic combination mode.
7. The method of claim 4, wherein the identifying whether duplicate description information exists for different dimensions in the comment information comprises:
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; wherein the presence of repeated description information for different dimensions in the comment information includes: the hash values of the description information of different dimensions are the same;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information; segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; calculating the hashed values of the multiple segments of each dimension respectively, and calculating the Jacard Jaccard coefficients between the hashed values of the multiple segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; wherein the presence of repeated description information for different dimensions in the comment information includes: the condition that the Jaccard coefficients of different dimensions are larger than a preset similarity threshold exists;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; if the situation that the hash values of the description information of different dimensions are the same does not exist, segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; respectively calculating hash values of a plurality of segments of each dimension, and calculating Jaccard coefficients between the hash values of a plurality of segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; the existence of repeated description information aiming at different dimensions in the comment information comprises the following steps: there are cases where the hash values of the description information for different dimensions are the same, or there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
8. The method according to any one of claims 1 to 3, wherein the extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature comprises:
acquiring keywords in a second preset keyword table included in the comment information to obtain first target keywords;
respectively aiming at each first target keyword, obtaining the product of the word frequency TF of the first target keyword in the comment information and the inverse document frequency IDF of the first target keyword to obtain the word frequency-inverse document frequency TF-IDF value of the first target keyword; the first feature comprises a TF-IDF value of the first target keyword;
and/or the presence of a gas in the gas,
extracting feature information of each keyword in a third preset keyword list from the comment information to obtain a second feature, wherein the second feature comprises:
acquiring keywords in a third preset keyword table included in the comment information to obtain second target keywords;
respectively aiming at each second target keyword, obtaining the product of TF of the second target keyword in the comment information and IDF of the second target keyword to obtain TF-IDF value of the second target keyword; the first feature includes a TF-IDF value of the second target keyword.
9. The method according to any one of claims 1 to 3, wherein the basic attribute information includes: city information and project type information; and/or the presence of a gas in the gas,
the extracting features of the description information of each dimension for the target item from the comment information respectively to obtain at least one dimension feature includes:
respectively aiming at each dimension of the target item, extracting the characteristics of each word in the description information from the comment information;
respectively aiming at each dimension of the target item, acquiring preset high-frequency words matched with the features of each word in the description information as the features of the corresponding dimension based on the similarity between the features of each word in the description information and the features of each preset high-frequency word of the corresponding dimension; the at least one dimensional feature comprises: and the matched preset high-frequency words corresponding to all dimensions of the target item.
10. The method of any of claims 1-3, wherein the machine learning model comprises: and (5) gradient lifting iterative decision tree GDBT model.
11. An audit device of comment information is characterized by comprising:
the system comprises a receiving module, a verification processing module and a verification processing module, wherein the receiving module is used for receiving a verification request message, the verification request message comprises comment information and basic information of a target item, and the basic information comprises attribute information of each dimension of the target item;
the mode matching module is used for carrying out at least one preset mode matching on the comment information based on the basic information;
the machine learning model is used for responding to the comment information and obtaining a probability value of the comment information belonging to the publishable information based on at least one feature extracted from the comment information through the at least one preset pattern matching;
a determining module, configured to determine whether the comment information belongs to publishable information based on the probability value;
the first feature extraction module is used for extracting feature information of each keyword in a second preset keyword table from the comment information to obtain a first feature; the second preset keyword table comprises at least one keyword for limiting use;
the second feature extraction module is used for extracting feature information of each keyword in a third preset keyword table from the comment information to obtain a second feature; the third preset keyword table comprises at least one error information keyword;
the third feature extraction module is used for extracting features of the description information of each dimension of the target item from the comment information respectively to obtain at least one dimension feature;
the fourth feature extraction module is used for extracting features of the basic attribute information of the target project from the basic information to obtain project attribute features;
the machine learning model is specifically configured to obtain a probability value that the comment information belongs to the issuable information based on the first feature, the second feature, the at least one dimension feature, and the item attribute feature.
12. The apparatus of claim 11, further comprising:
and the output module is used for responding to the condition that the comment information is not matched with any one or more preset modes in the at least one preset mode, and outputting an audit result notification message that the comment information is not audited, wherein the audit result notification message that the comment information is not audited comprises information that is not matched with the at least one preset mode in the comment information.
13. The apparatus of claim 11, further comprising: an acquisition module for
In response to the detection of a piece of newly added comment information, taking the item targeted by the newly added comment information as a target item, taking the newly added comment information as comment information of the target item, acquiring basic information of the target item from an item database, and generating and sending the audit request message; alternatively, the first and second electrodes may be,
according to a preset auditing period, respectively taking a project as a target project, sequentially selecting a piece of newly-added comment information of the target project in the auditing period as comment information of the target project, acquiring basic information of the target project from a project database, and generating and sending auditing request information.
14. The apparatus according to any of claims 11-13, wherein the at least one preset pattern comprises any one or more of: the consistency mode of the comment information and the basic information does not exist, the mode of the limiting keywords does not exist, the mode accords with the preset logic combination mode, and the mode of repeatedly describing information does not exist;
the pattern matching module comprises any one or more of the following units: the first auditing unit, the second auditing unit, the third auditing unit and the fourth auditing unit; wherein:
the first auditing unit is used for matching the comment information with the basic information so as to determine whether information inconsistent with the basic information exists in the comment information; if the comment information does not have information inconsistent with the basic information, the comment information is matched with the basic information in a consistency mode;
the second auditing unit is used for identifying whether the comment information comprises a keyword in a first preset keyword table or not; the first preset keyword table comprises at least one keyword forbidden to be used; if the comment information does not include keywords in a preset keyword table, matching the comment information through the non-occurrence limited keyword mode;
the third auditing unit is used for identifying whether the comment information contains description information which does not accord with a preset logic combination mode; the preset logic combination mode comprises two or more sub-modes with logic relation; if the comment information does not have description information which does not accord with the preset logic combination mode, the comment information is matched through the mode which accords with the preset logic combination mode;
the fourth auditing unit is used for identifying whether repeated description information aiming at different dimensions exists in the comment information; if repeated description information aiming at different dimensions does not exist in the comment information, the comment information is matched through the mode of the non-existent repeated description information;
the comment information is matched through the at least one preset pattern, and the comment information comprises: the comment information is matched through all preset patterns in the at least one preset pattern.
15. The apparatus according to claim 14, wherein the first auditing unit is specifically configured to:
extracting attribute information of each dimension of the target item from the basic information, and extracting description information of at least one dimension of the target item from the comment information; wherein the review information for the target item includes descriptive information for at least one dimension of the target item;
and comparing whether the description information of at least one dimension has information inconsistent with the attribute information of the corresponding dimension in the basic information.
16. The apparatus according to claim 14, wherein the third auditing unit is specifically configured to:
extracting description information of at least one dimension for the target item from the comment information;
and identifying whether the description information of the at least one dimension has description information which does not accord with a preset logic combination mode.
17. The apparatus according to claim 14, wherein the fourth auditing unit is specifically configured to:
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; wherein the presence of repeated description information for different dimensions in the comment information includes: the hash values of the description information of different dimensions are the same;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information; segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; calculating the hashed values of the multiple segments of each dimension respectively, and calculating the Jacard Jaccard coefficients between the hashed values of the multiple segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; wherein the presence of repeated description information for different dimensions in the comment information includes: the condition that the Jaccard coefficients of different dimensions are larger than a preset similarity threshold exists;
alternatively, the first and second electrodes may be,
extracting description information of each dimension of the target item from the comment information, respectively calculating hash values of the description information of each dimension, and comparing whether the hash values of the description information of different dimensions are the same; if the situation that the hash values of the description information of different dimensions are the same does not exist, segmenting the description information of each dimension respectively according to a preset mode to obtain a plurality of segments of each dimension; respectively calculating hash values of a plurality of segments of each dimension, and calculating Jaccard coefficients between the hash values of a plurality of segments of any two different dimensions; comparing whether the Jaccard coefficients of different dimensions are larger than a preset similarity threshold value or not; the existence of repeated description information aiming at different dimensions in the comment information comprises the following steps: there are cases where the hash values of the description information for different dimensions are the same, or there are cases where the Jaccard coefficients for different dimensions are greater than a preset similarity threshold.
18. The apparatus according to any one of claims 11 to 13, wherein the first feature extraction module is specifically configured to: acquiring keywords in a second preset keyword table included in the comment information to obtain first target keywords; respectively aiming at each first target keyword, obtaining the product of the word frequency TF of the first target keyword in the comment information and the inverse document frequency IDF of the first target keyword to obtain the word frequency-inverse document frequency TF-IDF value of the first target keyword; the first feature comprises a TF-IDF value of the first target keyword;
and/or the presence of a gas in the gas,
the second feature extraction module is specifically configured to: acquiring keywords in a third preset keyword table included in the comment information to obtain second target keywords; respectively aiming at each second target keyword, obtaining the product of TF of the second target keyword in the comment information and IDF of the second target keyword to obtain TF-IDF value of the second target keyword; the first feature includes a TF-IDF value of the second target keyword.
19. The apparatus according to any one of claims 11-13, wherein the basic attribute information comprises: city information and project type information; and/or the presence of a gas in the gas,
the third feature extraction module is specifically configured to:
respectively aiming at each dimension of the target item, extracting the characteristics of each word in the description information from the comment information;
respectively aiming at each dimension of the target item, acquiring preset high-frequency words matched with the features of each word in the description information as the features of the corresponding dimension based on the similarity between the features of each word in the description information and the features of each preset high-frequency word of the corresponding dimension; the at least one dimensional feature comprises: and the matched preset high-frequency words corresponding to all dimensions of the target item.
20. The apparatus of any of claims 11-13, wherein the machine learning model comprises: and (5) gradient lifting iterative decision tree GDBT model.
21. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing a computer program stored in the memory, and when executed, implementing the method of any of the preceding claims 1-10.
22. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of the preceding claims 1 to 10.
CN202010886868.1A 2020-08-28 2020-08-28 Information processing method and apparatus, electronic device, and storage medium Active CN112199578B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010886868.1A CN112199578B (en) 2020-08-28 2020-08-28 Information processing method and apparatus, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010886868.1A CN112199578B (en) 2020-08-28 2020-08-28 Information processing method and apparatus, electronic device, and storage medium

Publications (2)

Publication Number Publication Date
CN112199578A CN112199578A (en) 2021-01-08
CN112199578B true CN112199578B (en) 2022-04-22

Family

ID=74005629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010886868.1A Active CN112199578B (en) 2020-08-28 2020-08-28 Information processing method and apparatus, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN112199578B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271509A (en) * 2018-08-23 2019-01-25 武汉斗鱼网络科技有限公司 Generation method, device, computer equipment and the storage medium of direct broadcasting room topic
CN109766428A (en) * 2019-02-02 2019-05-17 中国银行股份有限公司 Data query method and apparatus, data processing method
CN110362592A (en) * 2019-06-17 2019-10-22 平安科技(深圳)有限公司 Ruling director information method for pushing, device, computer equipment and storage medium
US10540446B2 (en) * 2018-01-31 2020-01-21 Jungle Disk, L.L.C. Natural language generation using pinned text and multiple discriminators

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1964037A4 (en) * 2005-12-16 2012-04-25 Nextbio System and method for scientific information knowledge management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10540446B2 (en) * 2018-01-31 2020-01-21 Jungle Disk, L.L.C. Natural language generation using pinned text and multiple discriminators
CN109271509A (en) * 2018-08-23 2019-01-25 武汉斗鱼网络科技有限公司 Generation method, device, computer equipment and the storage medium of direct broadcasting room topic
CN109766428A (en) * 2019-02-02 2019-05-17 中国银行股份有限公司 Data query method and apparatus, data processing method
CN110362592A (en) * 2019-06-17 2019-10-22 平安科技(深圳)有限公司 Ruling director information method for pushing, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112199578A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
Linton et al. Dynamic topic modelling for cryptocurrency community forums
US9996504B2 (en) System and method for classifying text sentiment classes based on past examples
Soni et al. Sentiment analysis of customer reviews based on hidden markov model
CN109325121B (en) Method and device for determining keywords of text
CN111651552B (en) Structured information determining method and device and electronic equipment
CN115809887B (en) Method and device for determining main business scope of enterprise based on invoice data
US11030228B2 (en) Contextual interestingness ranking of documents for due diligence in the banking industry with topicality grouping
EP4252139A1 (en) Systems and methods for relevance-based document analysis and filtering
US9558462B2 (en) Identifying and amalgamating conditional actions in business processes
CN114722141A (en) Text detection method and device
US20220319143A1 (en) Implicit Coordinates and Local Neighborhood
US11593385B2 (en) Contextual interestingness ranking of documents for due diligence in the banking industry with entity grouping
US11604923B2 (en) High volume message classification and distribution
Wang et al. Fake review identification methods based on multidimensional feature engineering
US11893008B1 (en) System and method for automated data harmonization
CN115329207B (en) Intelligent sales information recommendation method and system
CN112199578B (en) Information processing method and apparatus, electronic device, and storage medium
Sam Abraham et al. Readers’ affect: Predicting and understanding readers’ emotions with deep learning
Hussein et al. Machine learning approach to sentiment analysis in data mining
Tornés et al. Detecting forged receipts with domain-specific ontology-based entities & relations
CN113342969A (en) Data processing method and device
Kang et al. A transfer learning algorithm for automatic requirement model generation
CN111324707A (en) User interaction method and device, computer-readable storage medium and electronic equipment
CN113609407B (en) Regional consistency verification method and device
Kiomourtzis et al. A multi-lingually applicable journalist toolset for the big-data era

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210318

Address after: 100085 Floor 101 102-1, No. 35 Building, No. 2 Hospital, Xierqi West Road, Haidian District, Beijing

Applicant after: Seashell Housing (Beijing) Technology Co.,Ltd.

Address before: Unit 05, room 112, 1st floor, office building, Nangang Industrial Zone, economic and Technological Development Zone, Binhai New Area, Tianjin 300457

Applicant before: BEIKE TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant