WO2020181800A1

WO2020181800A1 - Apparatus and method for predicting score for question and answer content, and storage medium

Info

Publication number: WO2020181800A1
Application number: PCT/CN2019/116548
Authority: WO
Inventors: 程磊
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-03-12
Filing date: 2019-11-08
Publication date: 2020-09-17
Also published as: CN110069772B; CN110069772A

Abstract

An apparatus and a method for predicting a score for question and answer content, and a storage medium, the method comprising: collecting historical question and answer content of a written exam stage and a corresponding actual score for each question and answer content (S1); on the basis of the answer content, constructing a term segmentation library, a text corpus, a term frequency-inverse document frequency indicator model, and a latent Dirichlet allocation model, and storing same in a database (S2); importing the term segmentation library, the text corpus, the term frequency-inverse document frequency indicator model, and the latent Dirichlet allocation model in the database, performing term segmentation and term frequency calculation, and then inputting a result into the term frequency-inverse document frequency indicator model and the latent Dirichlet allocation model, and acquiring an outputted maximum likelihood array of historical question and answer content of a same topic as question and answer content to be scored; and on the basis of actual scores corresponding to the maximum likelihood array, calculating a predicted score for the question and answer content to be scored. The method is able to ensure objectivity and fairness in scoring.

Description

预测问答内容的评分的装置、方法及存储介质Device, method and storage medium for predicting scoring of question and answer content

本申请要求于2019年03月12日提交中国专利局、申请号为201910185054.2、发明名称为“预测问答内容的评分的装置、方法及存储介质”的中国专利申请的优先权，其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 12, 2019, the application number is 201910185054.2, and the invention title is "apparatus, method and storage medium for predicting the score of question and answer content", the entire content of which is incorporated by reference Incorporate in the application.

技术领域Technical field

本申请涉及数据分析技术领域，尤其涉及一种预测问答内容的评分的装置、方法及存储介质。This application relates to the field of data analysis technology, and in particular to a device, method and storage medium for predicting the score of question and answer content.

背景技术Background technique

目前，企业招聘都涉及到笔试环节，笔试环节中一般包括问答式的问题，特别是管理、产品等岗位，笔试环节中问答式的问题占据较大的部分。对于笔试环节中问答式的问题的评分方式一般是依赖于人工评分，这种人工评分的方式在很大程度上受到个人的主观思维及偏好的影响，影响评分的客观性，且费时费力。At present, corporate recruitment involves written examinations. The written examinations generally include question-and-answer questions, especially for positions such as management and products. In the written examination, question-and-answer questions occupy a larger part. The scoring method for question-and-answer questions in the written test generally relies on manual scoring. This manual scoring method is largely affected by personal subjective thinking and preferences, affecting the objectivity of the scoring, and is time-consuming and labor-intensive.

发明内容Summary of the invention

本申请的目的在于提供一种预测问答内容的评分的装置、方法及存储介质，旨在对笔试环节中的问答内容进行客观、公正的评分。The purpose of this application is to provide a device, a method and a storage medium for predicting the scoring of question and answer content, aiming to provide objective and fair scoring of the question and answer content in the written test.

为实现上述目的，本申请提供一种预测问答内容的评分的装置，所述预测问答内容的评分的装置包括存储器及与所述存储器连接的处理器，所述存储器中存储有可在所述处理器上运行的处理***，所述处理***被所述处理器执行时实现如下步骤：In order to achieve the above object, the present application provides a device for predicting the score of question and answer content. The device for predicting the score of question and answer content includes a memory and a processor connected to the memory. When the processing system is executed by the processor, the following steps are implemented:

收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；Collect the historical Q&A content of the written test session and the actual score corresponding to each Q&A content;

基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；Construct word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model based on the content of the question and answer and save them in the database;

导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；Import the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model in the database, segment the question and answer content to be scored based on the word segmentation database, and based on the corpus for the question and answer content to be scored After the word frequency is counted by word segmentation, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model, and the output of the historical question and answer content that belongs to the same topic as the question and answer content to be scored has the highest probability Queue

从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；From the queue with the highest probability, select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold as the similar queue;

若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。If the length of the similar queue is greater than or equal to 2, the actual score corresponding to each question and answer content in the similar queue is obtained, and the predicted score of the question and answer content to be scored is calculated based on the actual score corresponding to each question and answer content in the similar queue.

优选地，所述基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分的步骤，具体包括：Preferably, the step of calculating the predicted score of the question and answer content to be scored based on the actual score corresponding to each question and answer content in the similar queue specifically includes:

其中，P _j为该待评分的问答内容j的预测评分，

为该相似队列中全部问答内容对应的实际评分的均值，L为该相似队列的长度，Sim(i,j)为该相似队列的问答内容i与该待评分的问答内容j的相似度，r _i为该相似队列的问答内容i对应的实际评分。

Among them, P _j is the predicted score of the question and answer content j to be scored,

Is the average value of the actual score corresponding to all the question and answer content in the similar queue, L is the length of the similar queue, Sim(i,j) is the similarity between the question and answer content i of the similar queue and the question and answer content j to be scored, r _{i is} the actual score corresponding to the question and answer content _{i of} the similar queue.

优选地，所述处理***被所述处理器执行时，还实现如下步骤：Preferably, when the processing system is executed by the processor, the following steps are further implemented:

获取N个待评分的问答内容的的实际评分，基于该实际评分计算该N个待评分的问答内容的预测评分的平均绝对误差，包括：Obtain the actual scores of N question and answer content to be scored, and calculate the average absolute error of the predicted score of the N question and answer content to be scored based on the actual score, including:

其中，r _j为该待评分的问答内容j对应的实际评分，N为大于等于2的整数；

Among them, r _{j is} the actual score corresponding to the question and answer content j to be scored, and N is an integer greater than or equal to 2;

基于该平均绝对误差分析该预测评分的准确度。The accuracy of the prediction score is analyzed based on the average absolute error.

优选地，所述基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中的步骤，具体包括：Preferably, the step of constructing the word segmentation database, the corpus, the word frequency inverse text frequency index model and the implicit Dirichlet distribution model based on the question and answer content and storing them in the database specifically includes:

利用预定的分词算法对每一问答内容进行分词，得到每一问答内容的分词结果，基于该分词结果构造对应的分词库，基于该分词库生成对应的语料库，基于该语料库构造词频逆文本频率指数模型，基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型，在分别迭代训练该分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型预定次数后进行保存。Use the predetermined word segmentation algorithm to segment each question and answer content to obtain the word segmentation result of each question and answer content, construct the corresponding word segmentation database based on the word segmentation result, generate the corresponding corpus based on the word segmentation database, and construct the word frequency inverse text based on the corpus Frequency index model, based on the word frequency inverse text frequency index model to construct an implicit Dirichlet distribution model, and iteratively train the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model for a predetermined number of times. save.

为实现上述目的，本申请还提供一种预测问答内容的评分的方法，所述预测问答内容的评分的方法包括：In order to achieve the above object, the present application also provides a method for predicting the score of question and answer content, and the method for predicting the score of question and answer content includes:

S1，收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；S1, collect the historical Q&A content of the written test session and the actual score corresponding to each Q&A content;

S2，基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；S2: Construct the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model based on the question and answer content and save them in the database;

S3，导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；S3, import the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model in the database, segment the question and answer content to be scored based on the word segmentation database, and segment the question and answer to be scored based on the corpus After the word frequency of the content is counted, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the historical question and answer content that belongs to the same topic as the question and answer content to be scored. The queue with the highest probability;

S4，从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；S4: Select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold from the queue with the highest probability as the similarity queue;

S5，若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。S5: If the length of the similar queue is greater than or equal to 2, the actual score corresponding to each question and answer content in the similar queue is obtained, and the predicted score of the question and answer content to be scored is calculated based on the actual score corresponding to each question and answer content in the similar queue .

其中，P _j为该待评分的问答内容j的预测评分，

优选地，所述步骤S5之后，还包括：Preferably, after the step S5, it further includes:

优选地，所述预定的分词算法为隐马尔科夫算法。Preferably, the predetermined word segmentation algorithm is a hidden Markov algorithm.

本申请还提供一种计算机可读存储介质，所述计算机可读存储介质上存储有处理***，所述处理***被处理器执行时实现上述的预测问答内容的评分的方法的步骤。The present application also provides a computer-readable storage medium on which a processing system is stored. When the processing system is executed by a processor, the steps of the method for predicting the scoring of question and answer content are realized.

本申请的有益效果是：本申请首先基于已有的笔试环节的海量问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，然后，在应用时，基于分词库对待评分的问答内容进行分词，基于该语料库统计词频，最后，将词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，得到输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列，在该队列中获取较相似的队列对应的实际评分，基于该实际评分计算该待评分的问答内容的预测评分，本申请通过海量问答内容不断的重复训练得到模型，消除评分者的主观思维和偏好对评分的客观性的影响，保障评分的客观公正性，且省时省力。The beneficial effects of this application are: this application first constructs the word segmentation database, the corpus, the word frequency inverse text frequency index model and the implicit Dirichlet distribution model based on the massive question and answer content of the existing written test session, and then, when applying, based on the score The thesaurus performs word segmentation on the question and answer content to be scored. Based on the corpus, the word frequency is counted. Finally, the word frequency statistics are sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the output question and answer to be scored The content belongs to the queue with the highest probability of the historical question and answer content of the same topic. The actual score corresponding to the more similar queue is obtained in the queue, and the predicted score of the question and answer content to be scored is calculated based on the actual score. This application passes a large amount of question and answer content The model is obtained by repeated training, which eliminates the influence of the subjective thinking and preference of the scorer on the objectivity of the score, guarantees the objective and fairness of the score, and saves time and effort.

附图说明Description of the drawings

图1为本申请各个实施例一可选的应用环境示意图；FIG. 1 is a schematic diagram of an optional application environment of each embodiment of this application;

图2是图1中预测问答内容的评分的装置一实施例的硬件架构的示意图；2 is a schematic diagram of the hardware architecture of an embodiment of the apparatus for predicting the score of question and answer content in FIG. 1;

图3为图1、图2中处理***一实施例的程序模块图；Fig. 3 is a program module diagram of an embodiment of the processing system in Fig. 1 and Fig. 2;

图4为本申请预测问答内容的评分的方法一实施例的流程示意图。FIG. 4 is a schematic flowchart of an embodiment of a method for predicting the score of question and answer content in this application.

具体实施方式detailed description

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions, and advantages of this application clearer, the following further describes this application in detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the application, and not used to limit the application. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

需要说明的是，在本申请中涉及“第一”、“第二”等的描述仅用于描述目的，而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外，各个实施例之间的技术方案可以相互结合，但是必须是以本领域普通技术人员能够实现为基础，当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在，也不在本申请要求的保护范围之内。It should be noted that the descriptions related to "first", "second", etc. in this application are only for descriptive purposes, and cannot be understood as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Therefore, the features defined with "first" and "second" may explicitly or implicitly include at least one of the features. In addition, the technical solutions between the various embodiments can be combined with each other, but it must be based on what can be achieved by a person of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be achieved, it should be considered that such a combination of technical solutions does not exist. , Not within the scope of protection required by this application.

参阅图1所示，是本申请较佳实施例的应用环境示意图。该应用环境示意图中，该预测问答内容的评分的装置1与输入装置2、输出装置3通过网络4相连接。通过输入装置2输入待评分的问答内容，预测问答内容的评分的装置1对待评分的问答内容进行预测评分，将预测评分通过网络4传输至输出装置3。预测问答内容的评分的装置1包括处理***10(APP)，处理***10对待评分的问答内容进行分析得到预测评分，通过输出装置3输出。Refer to FIG. 1, which is a schematic diagram of the application environment of the preferred embodiment of the present application. In the schematic diagram of the application environment, the device 1 for predicting the score of the question and answer content is connected to the input device 2 and the output device 3 via a network 4. The Q&A content to be scored is input through the input device 2, and the device 1 for predicting the scoring of the Q&A content performs predictive scoring on the Q&A content to be scored, and transmits the predicted score to the output device 3 through the network 4. The device 1 for predicting the score of question and answer content includes a processing system 10 (APP), and the processing system 10 analyzes the question and answer content to be scored to obtain a predicted score, which is output through the output device 3.

所述预测问答内容的评分的装置1是一种能够按照事先设定或者存储的指令，自动进行数值计算和/或信息处理的设备。所述预测问答内容的评分的装置1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云，其中云计算是分布式计算的一种，由一群松散耦合的计算机集组成的一个超级虚拟计算机。The device 1 for predicting the score of question and answer content is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions. The device 1 for predicting the score of question and answer content can be a computer, a single web server, a server group composed of multiple web servers, or a cloud composed of a large number of hosts or web servers based on cloud computing, where cloud computing is distributed A type of computing, a super virtual computer composed of a group of loosely coupled computer sets.

在本实施例中，如图2所示，预测问答内容的评分的装置1可包括，但不仅限于，可通过***总线相互通信连接的存储器11、处理器12、网络接口13，存储器11存储有可在处理器12上运行的处理***。需要指出的是，图2仅示出了具有组件11-13的预测问答内容的评分的装置1，但是应理解的是，并不要求实施所有示出的组件，可以替代的实施更多或者更少的组件。In this embodiment, as shown in FIG. 2, the device 1 for predicting the score of question and answer content may include, but is not limited to, a memory 11, a processor 12, and a network interface 13 that can be communicatively connected to each other through a system bus. The memory 11 stores A processing system that can run on the processor 12. It should be pointed out that FIG. 2 only shows the device 1 with components 11-13 for predicting the score of question and answer content, but it should be understood that it is not required to implement all the components shown, and more or more components may be implemented instead. Few components.

其中，存储器11包括内存及至少一种类型的可读存储介质。内存为预测问答内容的评分的装置1的运行提供缓存；可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如，SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的非易失性存储介质。在一些实施例中，可读存储介质可以是预测问答内容的评分的装置1的内部存储单元，例如该预测问答内容的评分的装置1的硬盘；在另一些实施例中，该非易失性存储介质也可以是预测问答内容的评分的装置1的外部存储设备，例如预测问答内容的评分的装置1上配备的插接式硬盘，智能存储卡(Smart Media Card,SMC)，安全数字(Secure Digital,SD)卡，闪存卡(Flash Card)等。本实施例中，存储器11的可读存储介质通常用于存储安装于预测问答内容的评分的装置1的操作***和各类应用软件，例如存储本申请一实施例中的处理***的程序代码等。此外，存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。Among them, the memory 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the device 1 for predicting the score of the question and answer content; the readable storage medium can be, for example, flash memory, hard disk, multimedia card, card type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Non-volatile storage media such as random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. . In some embodiments, the readable storage medium may be an internal storage unit of the device 1 for predicting the score of question and answer content, such as the hard disk of the device 1 for predicting the score of question and answer content; in other embodiments, the non-volatile The storage medium may also be an external storage device of the device 1 for predicting the score of the question and answer content, for example, a plug-in hard disk equipped on the device 1 for predicting the score of the question and answer content, a smart media card (SMC), and a secure digital (Secure Digital). Digital, SD) card, flash card (Flash Card), etc. In this embodiment, the readable storage medium of the memory 11 is generally used to store the operating system and various application software installed in the device 1 for predicting the score of question and answer content, such as storing the program code of the processing system in an embodiment of the present application, etc. . In addition, the memory 11 can also be used to temporarily store various types of data that have been output or will be output.

所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit，CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述预测问答内容的评分的装置1的总体操作，例如执行与其他装置进行数据交互或者通信相关的控制和处理等。本实施例中，所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据，例如运行处理***等。The processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is generally used to control the overall operation of the device 1 for predicting the score of the question and answer content, such as performing data interaction or communication-related control and processing with other devices. In this embodiment, the processor 12 is used to run the program code or processing data stored in the memory 11, for example, to run a processing system.

所述网络接口13可包括无线网络接口或有线网络接口，该网络接口13通常用于在所述预测问答内容的评分的装置1与其他装置之间建立通信连接。本实施例中，网络接口13主要用于将预测问答内容的评分的装置1与输入装置2、输出装置3相连，建立数据传输通道和通信连接。The network interface 13 may include a wireless network interface or a wired network interface. The network interface 13 is generally used to establish a communication connection between the device 1 for predicting the score of the question and answer content and other devices. In this embodiment, the network interface 13 is mainly used to connect the device 1 for predicting the score of the question and answer content with the input device 2 and the output device 3 to establish a data transmission channel and a communication connection.

所述处理***存储在存储器11中，包括至少一个存储在存储器11中的计算机可读指令，该至少一个计算机可读指令可被处理器器12执行，以实现本申请各实施例的方法；以及，该至少一个计算机可读指令依据其各部分所实现的功能不同，可被划为不同的逻辑模块。The processing system is stored in the memory 11, and includes at least one computer readable instruction stored in the memory 11, and the at least one computer readable instruction can be executed by the processor 12 to implement the method of each embodiment of the present application; and The at least one computer-readable instruction can be divided into different logic modules according to the different functions implemented by its parts.

在一实施例中，上述处理***被所述处理器12执行时实现如下步骤：In an embodiment, the above processing system implements the following steps when being executed by the processor 12:

参照图3所示，为图1、图2中处理***10的程序模块图。所述处理***10被分割为多个模块，该多个模块被存储于存储器12中，并由处理器13执行，以完成本申请。本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段。Referring to FIG. 3, it is a program module diagram of the processing system 10 in FIGS. 1 and 2. The processing system 10 is divided into multiple modules, and the multiple modules are stored in the memory 12 and executed by the processor 13 to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.

收集模块101，用于收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；The collection module 101 is used to collect the historical question and answer content of the written test session and the actual score corresponding to each question and answer content;

构造模块102，用于基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；The construction module 102 is used to construct a word-segment database, a corpus, a word frequency inverse text frequency index model and an implicit Dirichlet distribution model based on the question and answer content and save them in the database;

输出模块103，用于导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；The output module 103 is used to import the word segmentation database, the corpus, the word frequency inverse text frequency index model and the implicit Dirichlet distribution model in the database, and perform word segmentation on the question and answer content to be scored based on the word segmentation database. After the word frequency of the question and answer content to be scored is counted, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the output and the question and answer content to be scored that belong to the same topic The queue with the highest probability of historical Q&A content;

选取模块104，用于从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；The selection module 104 is configured to select, from the queue with the highest probability, the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold as a similar queue;

预测评分模块105，用于若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。The predictive scoring module 105 is configured to, if the length of the similar queue is greater than or equal to 2, obtain the actual score corresponding to each question and answer content in the similar queue, and calculate the to-be-scored based on the actual score corresponding to each question and answer content in the similar queue Predictive score for Q&A content.

如图4所示，图4为本申请预测问答内容的评分的方法一实施例的流程示意图，该预测问答内容的评分的方法包括以下步骤：As shown in FIG. 4, FIG. 4 is a schematic flowchart of an embodiment of the method for predicting the score of question and answer content according to this application. The method for predicting the score of question and answer content includes the following steps:

步骤S1，收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；Step S1, collecting the historical question and answer content of the written test session and the actual score corresponding to each question and answer content;

其中，收集各企业中笔试环节海量的历史问答内容，该问答内容包括企业招聘给出的答题内容、应聘者的回答内容，例如，答题内容为“和同事意见不同时怎么处理”，回答内容为“当遇到和同事意见相左的时候，我会结合实际进行沟通”。针对该回答内容企业会给出相应的实际评分。Among them, a large amount of historical question and answer content of the written test in each company is collected. The content of the question and answer includes the content of the answer given by the company's recruitment and the content of the applicant's answer. For example, the answer content is "How to deal with when the opinions differ from colleagues", and the answer content is "When I have a disagreement with my colleagues, I will communicate with the actual situation." For the content of the answer, the company will give the corresponding actual score.

进一步地，为了减少干扰信息，可以对问答内容进行数据清洗，包括对数据的拼写错误、乱码等等进行清洗，但不需要去除重复作答的情况，因为后续涉及到词频统计。Further, in order to reduce interference information, data cleaning can be performed on the content of the question and answer, including data spelling errors, garbled codes, etc., but there is no need to remove repeated answers, because the subsequent word frequency statistics are involved.

步骤S2，基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；Step S2, construct a word segmentation database, a corpus, a word frequency inverse text frequency index model and an implicit Dirichlet distribution model based on the question and answer content and save them in the database;

首先，利用预定的分词算法对每一问答内容进行分词，得到每一问答内容的分词结果，基于该分词结果构造对应的分词库，基于该分词库生成对应的语料库，基于该语料库构造词频逆文本频率指数模型，基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型，为了得到较优的库及模型，在分别迭代训练该分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型预定次数(例如，5000次)后进行保存。First, use a predetermined word segmentation algorithm to segment each question and answer content to obtain the word segmentation result of each question and answer content, construct the corresponding word segmentation database based on the word segmentation result, generate the corresponding corpus based on the word segmentation database, and construct the word frequency based on the corpus Inverse text frequency index model, based on the word frequency inverse text frequency index model to construct an implicit Dirichlet distribution model, in order to obtain a better library and model, the word segmentation database, corpus, word frequency inverse text frequency index model and The implicit Dirichlet distribution model is saved after a predetermined number of times (for example, 5000 times).

其中，预定的分词算法可以是隐马尔科夫算法，也可以是其他的算法，例如正向最大匹配算法(即将问答内容从左向右取m个字符作为匹配字段， m为大机器词典中最长词条个数，匹配成功则进行分词)、逆向最大匹配算法(为正向最大匹配算法的逆向思想)、双向最大匹配法(即将正向最大匹配法得到的分词结果和逆向最大匹配法的到的结果进行比较，从而决定正确的分词方法)等。在一优选的实施例中，也可以使用长词优先的规则进行分词：首先，按预设类型标点符号(例如，“，”、“、”等)对问答内容进行短句拆分，例如，从该问答内容起始位置至第一个预设类型标点符号之间的信息为一个短句，从第一个预设类型标点符号至第二个预设类型标点符号之间的信息为一个短句，以此类推。对拆分的每一个短句，采用长词优先原则继续进行分词。长词优先原则指的是：对于一个需要分词的短句T1，先从第一个字A开始，从预建的词库找出一个由A起始的最长词语X1，然后从T1中剔除X1剩下T2，再对T2采用相同的切分原理，切分后的结果为“X1/X2/……”。Among them, the predetermined word segmentation algorithm can be a hidden Markov algorithm, or other algorithms, such as a forward maximum matching algorithm (that is, m characters from left to right in the question and answer content are taken as the matching field, and m is the most in the large machine dictionary. The number of long terms, the matching is successful, the word segmentation is performed), the reverse maximum matching algorithm (the reverse idea of the forward maximum matching algorithm), the two-way maximum matching method (that is, the word segmentation result obtained by the forward maximum matching method and the reverse maximum matching method Compare the results to determine the correct word segmentation method) and so on. In a preferred embodiment, the long word priority rule can also be used for word segmentation: first, the question and answer content is split into short sentences according to preset types of punctuation (for example, ",", ",", etc.), for example, The information from the beginning of the question and answer content to the first preset type punctuation mark is a short sentence, and the information from the first preset type punctuation mark to the second preset type punctuation mark is a short sentence Sentence, and so on. For each short sentence split, use the long word first principle to continue word segmentation. The long word priority principle refers to: for a short sentence T1 that needs word segmentation, start with the first character A, find the longest word X1 starting from A from the pre-built thesaurus, and then remove it from T1 X1 is left with T2, and the same splitting principle is used for T2, and the result after splitting is "X1/X2/...".

基于该分词结果构造对应的分词库，分词库的形式例如：(“同事”，4)，(“意见”，3)，(“结合实际”，1)，(“沟通”，2)。其中，(“同事”，4)表示分词“同事”在分词库中的编号为4。Based on the word segmentation result, the corresponding word segmentation database is constructed. For example: ("colleague", 4), ("opinion", 3), ("combined with reality", 1), ("communication", 2) . Among them, ("colleague", 4) means that the number of the participle "colleague" in the word segmentation database is 4.

基于该分词库生成对应的语料库，语料库为统计分词在一个问答内容中出现的次数，即词频。语料库的形式例如：[(0,2),(1,1),(2,1)],[(3,1),(4,1),(5,1)]。每个中括号代表一个问答内容，用逗号隔开，(0,2)代表编号为0的分词在这问答内容中出现过2次。A corresponding corpus is generated based on the word segmentation database. The corpus is a count of the number of occurrences of word segmentation in a question and answer content, that is, word frequency. The form of the corpus is for example: [(0,2),(1,1),(2,1)],[(3,1),(4,1),(5,1)]. Each square bracket represents a question and answer content, separated by a comma, (0,2) represents that the participle numbered 0 has appeared twice in this question and answer content.

基于该语料库构造词频逆文本频率指数TF-IDF模型，TF-IDF模型由两部分组成，一部分是TF(Token Frequency)，表示一个分词在一个问答内容中出现的次数，即词频；另一部分是IDF(Inverse Document Frequency)，表示某个分词出现在多少个问答内容中，即逆向文档频率。如果某个分词在一问答内容中出现的频率TF高，并且在其他问答内容中很少出现，则认为此分词具有很好的类别区分能力，适合用来分类。TF-IDF模型的形式例如：[(0， 0.1469)，(1，0.2842)，(2，0.2561)，(3，0.1528)]，(0，0.1469)表示编号为0的分词对此问答内容的重要性概率为0.1469。The word frequency inverse text frequency index TF-IDF model is constructed based on the corpus. The TF-IDF model consists of two parts. One part is TF (Token Frequency), which represents the number of times a word appears in a question and answer content, that is, word frequency; the other part is IDF (Inverse Document Frequency), which indicates how many questions and answers a certain word appears in, that is, inverse document frequency. If a certain participle has a high frequency of TF in one question and answer content and rarely appears in other question and answer content, it is considered that this participle has good classification ability and is suitable for classification. The form of the TF-IDF model, for example: [(0, 0.1469), (1, 0.2842), (2, 0.2561), (3, 0.1528)], (0, 0.1469) indicates that the word segmentation numbered 0 is the content of this question and answer The importance probability is 0.1469.

基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型(Latent Dirichlet Allocation，LDA)，隐含狄利克雷分布模型是一种文档主题生成模型。隐含狄利克雷分布模型记录了每一问答内容属于不同主题的概率分布，其形式如下：The Latent Dirichlet Allocation (LDA) model is constructed based on the term frequency inverse text frequency index model. The Latent Dirichlet Allocation model is a document topic generation model. The implicit Dirichlet distribution model records the probability distribution of each question and answer content belonging to a different topic, and its form is as follows:

[(主题一),(主题二),(主题三)][(Theme One),(Theme Two),(Theme Three)]

[(0,0.7188),(1,0.1550),(2,0.1260)][(0,0.7188),(1,0.1550),(2,0.1260)]

[(0,0.2856),(1,0.6423),(2,0.0719)][(0,0.2856),(1,0.6423),(2,0.0719)]

[(0,0.4189),(1,0.3004),(2,0.2806)][(0,0.4189),(1,0.3004),(2,0.2806)]

其中，(0,0.7188)表示问答内容一属于主题一的概率为0.7188，0表示问答内容一，1表示问答内容二，2表示问答内容三。经过比较，问答内容一和问答内容三属于主题一的概率最大，问答内容二属于主题二的概率最大。Among them, (0,0.7188) means that the probability of question and answer content one belongs to topic one is 0.7188, 0 means question and answer content one, 1 means question and answer content two, and 2 means question and answer content three. After comparison, Q&A content 1 and Q&A content 3 have the highest probability of belonging to topic 1, and Q&A content 2 has the highest probability of belonging to topic 2.

步骤S3，导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；Step S3, import the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model in the database, segment the question and answer content to be scored based on the word segmentation database, and perform word segmentation on the subject to be scored based on the corpus After the word frequency of the question and answer content is counted, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the historical question and answer that belongs to the same topic as the question and answer content to be scored. The queue with the highest content probability;

其中，上述的迭代训练后的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型应用于评分场景时，同时导入并进行应用。对于待评分的问答内容，可以将其进行清洗后(与上述清洗方式一样)，然后基于分词库进行分词、基于语料库统计词频，将词频统计结果输入词频逆文本频率指数模型训练后，再将该词频逆文本频率指数模型输出的结果输入该隐含狄利克雷分布模型中，该隐含狄利克雷分布模型输出的是与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列。Among them, the above-mentioned iteratively trained word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model are imported and applied at the same time when they are applied to the scoring scene. For the question and answer content to be scored, it can be cleaned (same as the cleaning method mentioned above), then word segmentation is performed based on the word segmentation database, word frequency is calculated based on the corpus, and the word frequency statistics result is input into the word frequency inverse text frequency index model after training. The output result of the word frequency inverse text frequency index model is input into the implicit Dirichlet distribution model, and the output of the implicit Dirichlet distribution model is the historical question and answer content that belongs to the same topic as the question and answer content to be scored with the highest probability Queue.

步骤S4，从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；Step S4, selecting question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold from the queue with the highest probability as the similarity queue;

例如，概率最大的队列为(3,0.8550)、(4,0.6423)、(7,0.9004)，其中，3表示问答内容四，4表示问答内容五，7表示问答内容八。以同属于一个主题的概率作为相似度，选取与该待评分的问答内容相似度大于等于预定阈值(例如，0.85)的问答内容，若相似度为0.85，则问答内容四及问答内容八作为该待评分的问答内容的相似队列。For example, the queues with the highest probability are (3,0.8550), (4,0.6423), (7,0.9004), where 3 represents four question and answer content, 4 represents five question and answer content, and 7 represents eight question and answer content. Taking the probability of belonging to the same topic as the similarity, select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold (for example, 0.85). If the similarity is 0.85, then question and answer content four and eight are used as the A similar queue of questions and answers to be graded.

步骤S5，若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。Step S5: If the length of the similar queue is greater than or equal to 2, obtain the actual score corresponding to each question and answer content in the similar queue, and calculate the prediction of the question and answer content to be scored based on the actual score corresponding to each question and answer content in the similar queue score.

其中，若该相似队列的长度为1，则以该相似队列的实际评分作为该待评分的问答内容的预测评分；Wherein, if the length of the similar queue is 1, the actual score of the similar queue is used as the predicted score of the question and answer content to be scored;

若该相似队列的长度大于等于2，则基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分，包括：If the length of the similar queue is greater than or equal to 2, the predicted score of the question and answer content to be scored is calculated based on the actual score corresponding to each question and answer content in the similar queue, including:

其中，P _j为该待评分的问答内容j的预测评分，

为该相似队列中全部问答内容对应的实际评分的均值，L为该相似队列的长度(大于等于2的整数)，Sim(i,j)为该相似队列的问答内容i与该待评分的问答内容j的相似度，r _i为该相似队列的问答内容i对应的实际评分。

Is the average value of the actual score corresponding to all the question and answer content in the similar queue, L is the length of the similar queue (an integer greater than or equal to 2), and Sim(i,j) is the question and answer content i of the similar queue and the question and answer to be scored The similarity of content j, r _{i is} the actual score corresponding to the question and answer content i of the similar queue.

进一步地，为了评价上述预测评分的准确度，还可以获取N个待评分的问答内容的的实际评分，基于该实际评分计算该N个待评分的问答内容的预测评分的平均绝对误差，基于该平均绝对误差对该隐含狄利克雷分布模型进行评价，其中：Further, in order to evaluate the accuracy of the aforementioned prediction scores, the actual scores of the N question and answer content to be scored can also be obtained, and the average absolute error of the predicted scores of the N question and answer content to be scored is calculated based on the actual score. The average absolute error evaluates the implicit Dirichlet distribution model, where:

其中，r _j为该待评分的问答内容j对应的实际评分，N 为大于等于2的整数。

Among them, r _{j is} the actual score corresponding to the question and answer content j to be scored, and N is an integer greater than or equal to 2.

其中，平均绝对误差越接近0，则预测评分的准确度高，上述的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型的迭代训练效果越好。Among them, the closer the average absolute error is to 0, the higher the accuracy of the prediction score. The iterative training effect of the above-mentioned word segmentation, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model is better.

与现有技术相比，本申请首先基于已有的笔试环节的海量问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，然后，在应用时，基于分词库对待评分的问答内容进行分词，基于该语料库统计词频，最后，将词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，得到输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列，在该队列中获取较相似的队列对应的实际评分，基于该实际评分计算该待评分的问答内容的预测评分，本申请通过海量问答内容不断的重复训练得到模型，消除评分者的主观思维和偏好对评分的客观性的影响，保障评分的客观公正性，且省时省力。Compared with the prior art, this application first constructs the word segmentation database, the corpus, the word frequency inverse text frequency index model and the implicit Dirichlet distribution model based on the massive question and answer content of the existing written test session, and then, in the application, based on the score The thesaurus performs word segmentation on the question and answer content to be scored. Based on the corpus, the word frequency is counted. Finally, the word frequency statistics are sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the output question and answer to be scored The content belongs to the queue with the highest probability of the historical question and answer content of the same topic. The actual score corresponding to the more similar queue is obtained in the queue, and the predicted score of the question and answer content to be scored is calculated based on the actual score. This application passes a large amount of question and answer content The model is obtained by repeated training, which eliminates the influence of the subjective thinking and preference of the scorer on the objectivity of the score, guarantees the objective and fairness of the score, and saves time and effort.

上述本申请实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the method described in each embodiment of the present application.

以上仅为本申请的优选实施例，并非因此限制本申请的专利范围，凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

一种预测问答内容的评分的装置，其特征在于，所述预测问答内容的评分的装置包括存储器及与所述存储器连接的处理器，所述存储器中存储有可在所述处理器上运行的处理***，所述处理***被所述处理器执行时实现如下步骤：A device for predicting the score of question and answer content, characterized in that the device for predicting the score of question and answer content includes a memory and a processor connected to the memory, and the memory stores a device that can run on the processor. A processing system, when the processing system is executed by the processor, the following steps are implemented:

收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；Collect the historical Q&A content of the written test session and the actual score corresponding to each Q&A content;

基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；Construct word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model based on the content of the question and answer and save them in the database;

导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；Import the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model in the database, segment the question and answer content to be scored based on the word segmentation database, and based on the corpus for the question and answer content to be scored After the word frequency is counted by word segmentation, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model, and the output of the historical question and answer content that belongs to the same topic as the question and answer content to be scored has the highest probability Queue

从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；From the queue with the highest probability, select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold as the similar queue;

若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。If the length of the similar queue is greater than or equal to 2, the actual score corresponding to each question and answer content in the similar queue is obtained, and the predicted score of the question and answer content to be scored is calculated based on the actual score corresponding to each question and answer content in the similar queue.
根据权利要求1所述的预测问答内容的评分的装置，其特征在于，所述基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分的计算公式为：The device for predicting the score of question and answer content according to claim 1, wherein the calculation formula for calculating the predicted score of the question and answer content to be scored based on the actual score corresponding to each question and answer content in the similar queue is:

其中，P _j为该待评分的问答内容j的预测评分，
为该相似队列中全部问答内容对应的实际评分的均值，L为该相似队列的长度，Sim(i,j)为该相似队列的问答内容i与该待评分的问答内容j的相似度， r _i为该相似队列的问答内容i对应的实际评分。
Among them, P _j is the predicted score of the question and answer content j to be scored,
Is the average value of the actual score corresponding to all the question and answer content in the similar queue, L is the length of the similar queue, Sim(i,j) is the similarity between the question and answer content i of the similar queue and the question and answer content j to be scored, r _{i is} the actual score corresponding to the question and answer content _{i of} the similar queue.
根据权利要求2所述的预测问答内容的评分的装置，其特征在于，所述处理***被所述处理器执行时，还实现如下步骤：The device for predicting the score of question and answer content according to claim 2, wherein when the processing system is executed by the processor, the following steps are further implemented:

获取N个待评分的问答内容的的实际评分，基于该实际评分计算该N个待评分的问答内容的预测评分的平均绝对误差，包括：Obtain the actual scores of N question and answer content to be scored, and calculate the average absolute error of the predicted score of the N question and answer content to be scored based on the actual score, including:

其中，r _j为该待评分的问答内容j对应的实际评分，N为大于等于2的整数；
Among them, r _{j is} the actual score corresponding to the question and answer content j to be scored, and N is an integer greater than or equal to 2;

基于该平均绝对误差分析该预测评分的准确度。The accuracy of the prediction score is analyzed based on the average absolute error.
根据权利要求1至3任一项所述的预测问答内容的评分的装置，其特征在于，所述基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中的步骤，具体包括：The device for predicting the scoring of question and answer content according to any one of claims 1 to 3, wherein the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution are constructed based on the question and answer content The steps to save the model to the database include:

利用预定的分词算法对每一问答内容进行分词，得到每一问答内容的分词结果，基于该分词结果构造对应的分词库，基于该分词库生成对应的语料库，基于该语料库构造词频逆文本频率指数模型，基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型，在分别迭代训练该分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型预定次数后进行保存。Use the predetermined word segmentation algorithm to segment each question and answer content to obtain the word segmentation result of each question and answer content, construct the corresponding word segmentation database based on the word segmentation result, generate the corresponding corpus based on the word segmentation database, and construct the word frequency inverse text based on the corpus Frequency index model, based on the word frequency inverse text frequency index model to construct an implicit Dirichlet distribution model, and iteratively train the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model for a predetermined number of times. save.
根据权利要求4所述的预测问答内容的评分的装置，其特征在于，所述预定的分词算法为隐马尔科夫算法。The device for predicting the score of question and answer content according to claim 4, wherein the predetermined word segmentation algorithm is a hidden Markov algorithm.
一种预测问答内容的评分的方法，其特征在于，所述预测问答内容的评分的方法包括：A method for predicting the scoring of question and answer content, characterized in that the method for predicting the scoring of question and answer content includes:

S1，收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；S1, collect the historical Q&A content of the written test session and the actual score corresponding to each Q&A content;

S2，基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；S2: Construct the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model based on the question and answer content and save them in the database;

S3，导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；S3, import the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model in the database, segment the question and answer content to be scored based on the word segmentation database, and segment the question and answer to be scored based on the corpus After the word frequency of the content is counted, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the historical question and answer content that belongs to the same topic as the question and answer content to be scored. The queue with the highest probability;

S4，从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；S4: Select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold from the queue with the highest probability as the similarity queue;

S5，若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。S5: If the length of the similar queue is greater than or equal to 2, the actual score corresponding to each question and answer content in the similar queue is obtained, and the predicted score of the question and answer content to be scored is calculated based on the actual score corresponding to each question and answer content in the similar queue .
根据权利要求6所述的预测问答内容的评分的方法，其特征在于，所述基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分的计算公式为：The method for predicting the score of question and answer content according to claim 6, wherein the calculation formula for calculating the predicted score of the question and answer content to be scored based on the actual score corresponding to each question and answer content in the similar queue is:

其中，P _j为该待评分的问答内容j的预测评分，
为该相似队列中全部问答内容对应的实际评分的均值，L为该相似队列的长度，Sim(i,j)为该相似队列的问答内容i与该待评分的问答内容j的相似度，r _i为该相似队列的问答内容i对应的实际评分。
Among them, P _j is the predicted score of the question and answer content j to be scored,
Is the average value of the actual score corresponding to all the question and answer content in the similar queue, L is the length of the similar queue, Sim(i,j) is the similarity between the question and answer content i of the similar queue and the question and answer content j to be scored, r _{i is} the actual score corresponding to the question and answer content _{i of} the similar queue.
根据权利要求7所述的预测问答内容的评分的方法，其特征在于，所述步骤S5之后，还包括：The method for predicting the score of question and answer content according to claim 7, characterized in that, after the step S5, it further comprises:

获取N个待评分的问答内容的的实际评分，基于该实际评分计算该N个待评分的问答内容的预测评分的平均绝对误差，包括：Obtain the actual scores of N question and answer content to be scored, and calculate the average absolute error of the predicted score of the N question and answer content to be scored based on the actual score, including:

其中，r _j为该待评分的问答内容j对应的实际评分，N为大于等于2的整数；
Among them, r _{j is} the actual score corresponding to the question and answer content j to be scored, and N is an integer greater than or equal to 2;

基于该平均绝对误差分析该预测评分的准确度。The accuracy of the prediction score is analyzed based on the average absolute error.
根据权利要求6至8任一项所述的预测问答内容的评分的方法，其特征在于，所述步骤S2具体包括：The method for predicting the scoring of question and answer content according to any one of claims 6 to 8, wherein the step S2 specifically includes:

利用预定的分词算法对每一问答内容进行分词，得到每一问答内容的分词结果，基于该分词结果构造对应的分词库，基于该分词库生成对应的语料库，基于该语料库构造词频逆文本频率指数模型，基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型，在分别迭代训练该分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型预定次数后进行保存。Use the predetermined word segmentation algorithm to segment each question and answer content to obtain the word segmentation result of each question and answer content, construct the corresponding word segmentation database based on the word segmentation result, generate the corresponding corpus based on the word segmentation database, and construct the word frequency inverse text based on the corpus Frequency index model, based on the word frequency inverse text frequency index model to construct an implicit Dirichlet distribution model, and iteratively train the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model for a predetermined number of times. save.
根据权利要求9所述的预测问答内容的评分的方法，其特征在于，所述预定的分词算法为隐马尔科夫算法。The method for predicting the scoring of question and answer content according to claim 9, wherein the predetermined word segmentation algorithm is a hidden Markov algorithm.
一种计算机可读存储介质，其特征在于，所述计算机可读存储介质上存储有处理***，所述处理***被处理器执行时实现如下步骤：A computer-readable storage medium, characterized in that a processing system is stored on the computer-readable storage medium, and when the processing system is executed by a processor, the following steps are implemented:

收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；Collect the historical Q&A content of the written test session and the actual score corresponding to each Q&A content;

基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；Construct word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model based on the content of the question and answer and save them in the database;

导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；Import the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model in the database, segment the question and answer content to be scored based on the word segmentation database, and based on the corpus for the question and answer content to be scored After the word frequency is counted by word segmentation, the result of the word frequency statistics is sequentially input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model, and the output of the historical question and answer content that belongs to the same topic as the question and answer content to be scored has the highest probability Queue

从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；From the queue with the highest probability, select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold as the similar queue;

若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。If the length of the similar queue is greater than or equal to 2, the actual score corresponding to each question and answer content in the similar queue is obtained, and the predicted score of the question and answer content to be scored is calculated based on the actual score corresponding to each question and answer content in the similar queue.
根据权利要求11所述的计算机可读存储介质，其特征在于，所述基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分的计算公式为：The computer-readable storage medium according to claim 11, wherein the calculation formula for calculating the predicted score of the question and answer content to be scored based on the actual score corresponding to each question and answer content in the similar queue is:

其中，P _j为该待评分的问答内容j的预测评分，
为该相似队列中全部问答内容对应的实际评分的均值，L为该相似队列的长度，Sim(i,j)为该相似队列的问答内容i与该待评分的问答内容j的相似度，r _i为该相似队列的问答内容i对应的实际评分。
Among them, P _j is the predicted score of the question and answer content j to be scored,
Is the average value of the actual score corresponding to all the question and answer content in the similar queue, L is the length of the similar queue, Sim(i,j) is the similarity between the question and answer content i of the similar queue and the question and answer content j to be scored, r _{i is} the actual score corresponding to the question and answer content _{i of} the similar queue.
根据权利要求12所述的计算机可读存储介质，其特征在于，所述处理***被所述处理器执行时，还实现如下步骤：The computer-readable storage medium according to claim 12, wherein when the processing system is executed by the processor, the following steps are further implemented:

获取N个待评分的问答内容的的实际评分，基于该实际评分计算该N个待评分的问答内容的预测评分的平均绝对误差，包括：Obtain the actual scores of N question and answer content to be scored, and calculate the average absolute error of the predicted score of the N question and answer content to be scored based on the actual score, including:

其中，r _j为该待评分的问答内容j对应的实际评分，N为大于等于2的整数；
Among them, r _{j is} the actual score corresponding to the question and answer content j to be scored, and N is an integer greater than or equal to 2;

基于该平均绝对误差分析该预测评分的准确度。The accuracy of the prediction score is analyzed based on the average absolute error.
根据权利要求11至13任一项所述的计算机可读存储介质，其特征在于，所述基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中的步骤，具体包括：The computer-readable storage medium according to any one of claims 11 to 13, wherein the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model are constructed based on the question and answer content The steps to save to the database include:

利用预定的分词算法对每一问答内容进行分词，得到每一问答内容的分词结果，基于该分词结果构造对应的分词库，基于该分词库生成对应的语料库，基于该语料库构造词频逆文本频率指数模型，基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型，在分别迭代训练该分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型预定次数后进行保存。Use the predetermined word segmentation algorithm to segment each question and answer content to obtain the word segmentation result of each question and answer content, construct the corresponding word segmentation database based on the word segmentation result, generate the corresponding corpus based on the word segmentation database, and construct the word frequency inverse text based on the corpus Frequency index model, based on the word frequency inverse text frequency index model to construct an implicit Dirichlet distribution model, and iteratively train the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model for a predetermined number of times. save.
根据权利要求14所述的计算机可读存储介质，其特征在于，所述预定的分词算法为隐马尔科夫算法。The computer-readable storage medium according to claim 14, wherein the predetermined word segmentation algorithm is a hidden Markov algorithm.
一种处理***，其特征在于，所述处理***包括收集模块、构造模块、输出模块、选取模块以及预测评分模块；A processing system, characterized in that the processing system includes a collection module, a construction module, an output module, a selection module, and a prediction scoring module;

收集模块，用于收集笔试环节历史的问答内容及对每一问答内容对应的实际评分；The collection module is used to collect the historical question and answer content of the written test and the actual score corresponding to each question and answer content;

构造模块，用于基于该问答内容构造分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型并保存至数据库中；The construction module is used to construct the word segmentation database, the corpus, the word frequency inverse text frequency index model and the implicit Dirichlet distribution model based on the question and answer content and save them in the database;

输出模块，用于导入该数据库中的分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型，基于该分词库对待评分的问答内容进行分词，基于该语料库对该待评分的问答内容的分词统计词频后，将该词频统计结果依次输入至词频逆文本频率指数模型及隐含狄利克雷分布模型中，获取输出的与该待评分的问答内容同属于一个主题的历史的问答内容概率最大的队列；The output module is used to import the word segmentation database, the corpus, the word frequency inverse text frequency index model and the implicit Dirichlet distribution model in the database, and perform word segmentation on the question and answer content to be scored based on the word segmentation database. After the word frequency of the scored question and answer content is counted, the result of the word frequency statistics is input into the word frequency inverse text frequency index model and the implicit Dirichlet distribution model to obtain the history of the same topic as the question and answer content to be scored. The queue with the highest probability of Q&A content;

选取模块，用于从该概率最大的队列中选取与该待评分的问答内容相似度大于等于预定阈值的问答内容作为相似队列；The selection module is used to select the question and answer content whose similarity to the question and answer content to be scored is greater than or equal to a predetermined threshold from the queue with the highest probability as the similarity queue;

预测评分模块，用于若该相似队列的长度大于等于2，则获取该相似队列中每一问答内容对应的实际评分，基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分。The predictive scoring module is used to obtain the actual score corresponding to each question and answer content in the similar queue if the length of the similar queue is greater than or equal to 2, and calculate the question and answer to be scored based on the actual score corresponding to each question and answer content in the similar queue The predicted score of the content.
根据权利要求16所述的处理***，其特征在于，所述基于该相似队列中每一问答内容对应的实际评分计算该待评分的问答内容的预测评分的计算公式为：The processing system according to claim 16, wherein the calculation formula for calculating the predicted score of the question and answer content to be scored based on the actual score corresponding to each question and answer content in the similar queue is:

其中，P _j为该待评分的问答内容j的预测评分，
为该相似队列中全部问答内容对应的实际评分的均值，L为该相似队列的长度，Sim(i,j)为该相似队列的问答内容i与该待评分的问答内容j的相似度，r _i为该相似队列的问答内容i对应的实际评分。
Among them, P _j is the predicted score of the question and answer content j to be scored,
Is the average value of the actual score corresponding to all the question and answer content in the similar queue, L is the length of the similar queue, Sim(i,j) is the similarity between the question and answer content i of the similar queue and the question and answer content j to be scored, r _{i is} the actual score corresponding to the question and answer content _{i of} the similar queue.
根据权利要求17所述的处理***，其特征在于，所述处理***被所述处理器执行时，还实现如下步骤：The processing system according to claim 17, wherein when the processing system is executed by the processor, the following steps are further implemented:

获取N个待评分的问答内容的的实际评分，基于该实际评分计算该N个待评分的问答内容的预测评分的平均绝对误差，包括：Obtain the actual scores of N question and answer content to be scored, and calculate the average absolute error of the predicted score of the N question and answer content to be scored based on the actual score, including:

其中，r _j为该待评分的问答内容j对应的实际评分，N为大于等于2的整数；
Among them, r _{j is} the actual score corresponding to the question and answer content j to be scored, and N is an integer greater than or equal to 2;

基于该平均绝对误差分析该预测评分的准确度。The accuracy of the prediction score is analyzed based on the average absolute error.
根据权利要求16至18任一项所述的处理***，其特征在于，所述构造模块具体包括：The processing system according to any one of claims 16 to 18, wherein the construction module specifically comprises:

利用预定的分词算法对每一问答内容进行分词，得到每一问答内容的分词结果，基于该分词结果构造对应的分词库，基于该分词库生成对应的语料库，基于该语料库构造词频逆文本频率指数模型，基于该词频逆文本频率指数模型构造隐含狄利克雷分布模型，在分别迭代训练该分词库、语料库、词频逆文本频率指数模型及隐含狄利克雷分布模型预定次数后进行保存。Use the predetermined word segmentation algorithm to segment each question and answer content to obtain the word segmentation result of each question and answer content, construct the corresponding word segmentation database based on the word segmentation result, generate the corresponding corpus based on the word segmentation database, and construct the word frequency inverse text based on the corpus Frequency index model, based on the word frequency inverse text frequency index model to construct an implicit Dirichlet distribution model, and iteratively train the word segmentation database, corpus, word frequency inverse text frequency index model and implicit Dirichlet distribution model for a predetermined number of times. save.
根据权利要求19所述的处理***，其特征在于，所述预定的分词算法为隐马尔科夫算法。The processing system according to claim 19, wherein the predetermined word segmentation algorithm is a hidden Markov algorithm.