CN111930792A

CN111930792A - Data resource labeling method and device, storage medium and electronic equipment

Info

Publication number: CN111930792A
Application number: CN202010580828.4A
Authority: CN
Inventors: 胡科; 包英泽
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2020-11-13
Anticipated expiration: 2040-06-23
Also published as: CN111930792B

Abstract

The embodiment of the application discloses a data resource labeling method and device, a storage medium and electronic equipment, and belongs to the technical field of computers. The method comprises the following steps: the server preprocesses the original data resource to obtain text data, similarity calculation is carried out on the text data and the target knowledge points respectively to obtain similarity values, a basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity values and the similarity threshold, a comprehensive knowledge point label set of the original data resource is generated according to the characteristic information of the original data resource and the basic knowledge point label set, the original data resource can be labeled with related knowledge point labels quickly and accurately, and the labeling efficiency and the labeling accuracy are improved.

Description

Data resource labeling method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for tagging data resources, a storage medium, and an electronic device.

Background

With the development of the internet, data plays an increasingly important role in the internet industry, such as: retail, transportation, social, search, education, medical, and other industries involve large-scale data analysis. Taking online education as an example, in an online education scene, a worker usually needs to analyze the teaching data of a user to obtain the teaching condition and the learning condition of the user, so as to provide better service for the user subsequently, while the process of analyzing the learning data of the user needs to obtain a knowledge point label associated with the data resource learned by the user, and similar application scenes are common in other fields. However, in the related art, the knowledge point labels associated with the data resources are usually labeled in advance by a manual method, and the method for labeling the knowledge point labels is inefficient and affected by subjective factors of a labeler, so that the knowledge point labels cannot be accurately labeled for the data resources.

Disclosure of Invention

The embodiment of the application provides a data resource labeling method, a data resource labeling device, a storage medium and electronic equipment, and can solve the problems of inaccuracy and low efficiency in labeling knowledge points of data resources in the related art.

The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for annotating a data resource, where the method includes:

preprocessing original data resources to obtain text data;

respectively carrying out similarity calculation on the text data and a plurality of target knowledge points to obtain similarity values; wherein each of the plurality of target knowledge points is associated with a basic knowledge point tag;

generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold value; wherein, the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with the target knowledge points with similarity values larger than a similarity threshold value;

and generating a comprehensive knowledge point label set of the original data resources according to the characteristic information of the original data resources and the basic knowledge point label set.

In a second aspect, an embodiment of the present application provides a device for annotating a data resource, where the device for annotating a data resource includes:

the preprocessing module is used for preprocessing the original data resource to obtain text data;

the calculation module is used for carrying out similarity calculation on the text data and the target knowledge points respectively to obtain similarity values; wherein each of the plurality of target knowledge points is associated with a basic knowledge point tag;

the first processing module is used for generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold value; wherein, the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with the target knowledge points with similarity values larger than a similarity threshold value;

and the second processing module is used for generating a comprehensive knowledge point label set of the original data resources according to the characteristic information of the original data resources and the basic knowledge point label set.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

when the scheme of the embodiment of the application is executed, the server preprocesses the original data resource to obtain the text data, the text data and the target knowledge points are subjected to similarity calculation to obtain the similarity values, the basic knowledge point label set of the original data resource is generated according to the comparison result of the similarity values and the similarity threshold values, the comprehensive knowledge point label set of the original data resource is generated according to the characteristic information of the original data resource and the basic knowledge point label set, the knowledge point labels related to the original data resource can be labeled on the original data resource quickly and accurately, and the labeling efficiency and the labeling accuracy are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of a system architecture provided by an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for tagging data resources according to an embodiment of the present application;

FIG. 3 is another schematic flow chart of a method for annotating data resources according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating a similarity calculation flow of a tagging method for data resources according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an exemplary system architecture 100 to which the annotation method for data resources or the annotation apparatus for data resources of the embodiments of the present application can be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is a medium used to provide communication links between the

terminal devices

101, 102, 103 and the server 105, and various communication client applications may be installed on the

terminal devices

101, 102, 103, such as: video recording application, video playing application, voice interaction application, search application, instant messaging tool, mailbox client, social platform software, etc. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may be various electronic devices having a display screen, including but not limited to smart phones, tablet computers, portable computers, desktop computers, and the like. The network 104 may include various types of wired or wireless communication links, such as: the wired communication link includes an optical fiber, a twisted pair wire, or a coaxial cable, and the WIreless communication link includes a bluetooth communication link, a WIreless-FIdelity (Wi-Fi) communication link, or a microwave communication link, etc. The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal apparatuses

101, 102, and 103 are software, they may be installed in the electronic apparatuses listed above. Which may be implemented as multiple software or software modules (e.g., to provide distributed services) or as a single software or software module, and is not particularly limited herein. When the

terminal devices

101, 102, and 103 are hardware, the terminal devices may further include a display device and a camera, the display device may display various devices capable of implementing a display function, and the camera is used to collect a video stream; for example: the display device may be a Cathode ray tube (CR) display, a Light-emitting diode (LED) display, an electronic ink screen, a Liquid Crystal Display (LCD), a Plasma Display Panel (PDP), or the like. The user can view information such as displayed text, pictures, videos, etc. using the display device on the

terminal device

101, 102, 103.

It should be noted that, the annotation method for data resources provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the annotation device for data resources is generally disposed in the server 105. The server 105 may be a server that provides various services, and the server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as a plurality of software or software modules (for example, for providing distributed services), or may be implemented as a single software or software module, and is not limited in particular herein.

The server 105 in the present application may be a terminal device providing various services, such as: the server preprocesses the original data resources to obtain text data, similarity calculation is carried out on the text data and the target knowledge points respectively to obtain similarity values, a basic knowledge point label set of the original data resources is generated according to the comparison results of the similarity values and the similarity threshold values, and a comprehensive knowledge point label set of the original data resources is generated according to the feature information of the original data resources and the basic knowledge point label set.

It should be noted that, the annotation method for the data resource provided in the embodiment of the present application may be executed by one or more of the

terminal devices

101, 102, and 103, and/or the server 105, and accordingly, the annotation device for the data resource provided in the embodiment of the present application is generally disposed in the corresponding terminal device, and/or the server 105, but the present application is not limited thereto.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The following describes in detail a method for tagging data resources according to an embodiment of the present application with reference to fig. 2 to 4. It should be noted that, for convenience of description, the embodiment is described by taking the online education industry as an example, but those skilled in the art understand that the application of the present application is not limited to the online education industry, and the method for tagging data resources described in the present application can be effectively applied to various industries of the internet.

Referring to fig. 2, a flow chart of a method for tagging data resources is provided in an embodiment of the present application. As shown in fig. 2, the method of the embodiment of the present application may include the steps of:

s201, preprocessing the original data resource to obtain text data.

The original data resources refer to data resources in the form of text, audio, video, and the like, and may include data resources such as exercises, sketches, learning audio, learning video, and the like, and the original data resources include learning levels and subjects corresponding to the data resources. The text data refers to data obtained by uniformly converting original data resources of types such as text, audio, video and the like into text types.

Generally, when the original data resource is of an audio or video type, the original data resource may be converted into text data of a predetermined text type by an ASR (Automatic Speech Recognition) technique. The ASR technique is a technique for converting audio into text based on a keyword list, converts audio (or audio in video) content into speech features through a spectrum, matches the speech features with entries in the keyword list, and takes the obtained optimal matching result as a recognition result. When the original data resource is a text type but not a preset text type, the text type needs to be converted into the preset text type. Common text types are: txt, doc, hlp, wps, rtf, htm, pdf and the like, and different text types can be set according to actual needs by presetting the text types.

And S202, respectively carrying out similarity calculation on the text data and the target knowledge points to obtain similarity values.

The target knowledge points comprise at least one of target content vocabularies, target high-frequency vocabularies, target verb vocabularies, target mathematic vocabularies, target phonetic symbols, target sentence patterns and target grammars, and are knowledge points which can be acquired from a knowledge map, different learning levels correspond to different target knowledge points, and a plurality of target knowledge points are respectively associated with a basic knowledge point label. Similarity values refer to the similarity relationship between two quantities being compared, and generally a greater similarity value indicates greater similarity between the two quantities.

Generally, before similarity calculation is performed on text data and a plurality of target knowledge points to obtain similarity values, course information corresponding to original data resources needs to be obtained, the course information is queried from a preset knowledge graph to obtain a plurality of target knowledge points corresponding to the course information, and the knowledge graph comprises the target knowledge points of different learning levels and different years. Similarity calculation is carried out on the text data and the target knowledge points respectively, namely similarity values of the basic knowledge points and the target knowledge points in the text data are calculated, and basic knowledge point labels corresponding to the text data, namely basic knowledge point labels corresponding to original data resources, can be judged based on the similarity values. The basic knowledge point in the text data may include one or more of a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary, a reference math vocabulary, a reference phonetic symbol, a reference sentence, and a reference grammar.

S203, generating a basic knowledge point label set of the original data resource according to the comparison result of the similarity value and the similarity threshold value.

The similarity threshold refers to a lowest limit value at which the similarity value can satisfy a condition, that is, a similarity threshold. The basic knowledge point label set is a set containing basic knowledge point labels corresponding to original data resources, and may include knowledge point labels respectively associated with a target content vocabulary, a target high-frequency vocabulary, a target verb vocabulary, a target mathematic vocabulary, a target phonetic symbol, a target sentence pattern, and a target grammar, where the basic knowledge point labels included in the basic knowledge point label set are basic knowledge point labels associated with target knowledge points whose similarity values are greater than a similarity threshold value.

Generally, according to the difference of the basic knowledge points and the difference of the similarity value calculation methods, the similarity value, the similarity threshold value and the basic knowledge point tag corresponding to the basic knowledge points are different, and when the target knowledge point is a target content vocabulary, the basic knowledge point tag set of the original data resource is generated according to the comparison result of the similarity value and the similarity threshold value, and the method comprises the following steps: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out blocking processing on the sentence set to obtain a word block set, extracting an importance degree weight of each word block in the word block set based on a keyword extraction TF-IDF algorithm, taking the word block with the importance degree weight larger than a first preset weight as a reference content word, calculating a similarity value between the reference content word and a target content word, when the similarity value is larger than a similarity threshold value, obtaining a basic knowledge point label associated with the target content word, and adding the basic knowledge point label associated with the target content word into the basic knowledge point label set.

When the target knowledge point is a target high-frequency vocabulary, generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold, wherein the basic knowledge point label set comprises: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out blocking processing on the sentence set to obtain a word block set, extracting an importance degree weight of each word block in the word block set based on a keyword extraction TF-IDF algorithm, taking the word block of which the importance degree weight is smaller than or equal to a second preset weight as a reference high-frequency word, calculating a similarity value between the reference high-frequency word and a target high-frequency word, when the similarity value is larger than a similarity threshold value, obtaining a basic knowledge point label associated with the target high-frequency word, and adding the basic knowledge point label associated with the target high-frequency word into the basic knowledge point label set.

When the target knowledge point is a target verb vocabulary, generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold, wherein the basic knowledge point label set comprises the following steps: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out word segmentation processing on the sentence set to obtain a word set, carrying out part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, taking the word with the part-of-speech being verb part-of-speech as a reference verb word, calculating a similarity value between the reference verb word and a target verb word, obtaining a basic knowledge point label associated with the target verb word when the similarity value is larger than a similarity threshold value, and adding the basic knowledge point label associated with the target verb word into the basic knowledge point label set.

When the target knowledge point is a target mathematical vocabulary, generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold, wherein the basic knowledge point label set comprises: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out word segmentation processing on the sentence set to obtain a word set, carrying out part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, taking the word with the part-of-speech being the part-of-speech of a digit as a reference mathematical vocabulary, calculating the similarity value between the reference mathematical vocabulary and a target mathematical vocabulary, obtaining a basic knowledge point label associated with the target mathematical vocabulary when the similarity value is larger than a similarity threshold value, and adding the basic knowledge point label associated with the target mathematical vocabulary into the basic knowledge point label set.

When the target knowledge point is a target phonetic symbol, generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold, wherein the basic knowledge point label set comprises: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out word segmentation processing on the sentence set to obtain a word set, analyzing each word in the word set, carrying out phonetic symbol on each word phonetic symbol to obtain a phonetic symbol set, calculating similarity values of the word phonetic symbols in the phonetic symbol set and target phonetic symbols, obtaining basic knowledge point labels associated with the target phonetic symbols when the similarity values are larger than a similarity threshold value, and adding the basic knowledge point labels associated with the target phonetic symbols into the basic knowledge point label set.

When the target knowledge point is a target sentence pattern, generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold value, wherein the basic knowledge point label set comprises: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out word segmentation processing on the sentence set to obtain a word set, carrying out part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, carrying out dependency syntactic analysis on the sentences in the sentence set to obtain a dependency syntactic tree, calculating similarity values of the words in the word set, the part-of-speech in the part-of-speech tagging set and the dependency syntactic tree which respectively correspond to the words, the part-of-speech and the syntactic tree in a target sentence pattern, obtaining basic knowledge point labels associated with the target sentence pattern when the similarity values are larger than a similarity threshold value, and adding the basic knowledge point labels associated with the target sentence pattern into the basic knowledge point label set.

When the target knowledge point is the target grammar, generating a basic knowledge point label set of the original data resource according to a comparison result of the similarity value and the similarity threshold, wherein the basic knowledge point label set comprises the following steps: the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, carrying out word segmentation processing on the sentence set to obtain a word set, carrying out part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, calculating a similarity value between grammar contained in the sentence set and target grammar based on the word set and the part-of-speech tagging set, obtaining basic knowledge point labels associated with the target grammar when the similarity value is larger than a similarity threshold value, and adding the basic knowledge point labels associated with the target grammar into the basic knowledge point label set.

And S204, generating a comprehensive knowledge point label set of the original data resources according to the characteristic information of the original data resources and the basic knowledge point label set.

The characteristic information refers to the type of the original data resource, for example, the type of the original data resource may include audio, video, text, picture book, etc., and different types may have different learning abilities (e.g., listening, speaking, reading, writing abilities) for training/exercising. The comprehensive knowledge point tags in the comprehensive knowledge point tag set are knowledge point tags which are generated based on the characteristic information of the original data resources and the basic knowledge point tag set and can reflect the ability of the original data resources to exercise users (students), and a plurality of comprehensive knowledge point tags corresponding to the original data resources can be provided, for example: the original data resource comprises audio, and the basic knowledge point label set comprises knowledge point labels associated with the target phonetic symbols, so that the comprehensive knowledge point labels corresponding to the original data resource can be analyzed and obtained to be the hearing labels.

Generally, the comprehensive knowledge point tags of the original data resources are related to the characteristics of the original data resources, the comprehensive knowledge point tags in the comprehensive knowledge point tag set and the basic knowledge point tags in the basic knowledge point tag set are both associated with the original data resources, and after the basic knowledge point tag set and the comprehensive knowledge point tag set are generated, it is indicated that the original data resources are labeled with related knowledge point tags (including the basic knowledge point tags and the comprehensive knowledge point tags).

As described above, the embodiments of the present application have been described mainly in the online education industry, but those skilled in the art will understand that the application of the method is not limited to the online education industry, and the method described in the present application can be applied to user tag processing in various industries such as retail, transportation, social, search, education, medical, etc.

Referring to fig. 3, a schematic flow chart of a method for tagging data resources is provided in an embodiment of the present application, where the method for tagging data resources includes the following steps:

s301, preprocessing the original data resource to obtain text data.

The original data resource refers to a data resource existing in a text, audio, video, or other type, and may include teaching resources such as exercises, sketches, learning audio, and learning video, for example, and the original data resource includes a learning level and a subject corresponding thereto. The text data refers to data obtained by uniformly converting original data resources of types such as text, audio, video and the like into text types. The basic knowledge point in the text data may include one or more of a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary, a reference math vocabulary, a reference phonetic symbol, a reference sentence, and a reference grammar.

S302, inquiring attribute information corresponding to the original data resource from a preset knowledge graph to obtain a plurality of target knowledge points corresponding to the attribute information.

The original data resource is a teaching resource, the attribute information is course information, and the course information refers to information related to courses such as learning levels and learning subjects corresponding to the original data resource. The target knowledge points comprise at least one of target content vocabularies, target high-frequency vocabularies, target verb vocabularies, target mathematic vocabularies, target phonetic symbols, target sentence patterns and target grammars, and are knowledge points which can be acquired from a knowledge map, different learning levels correspond to different target knowledge points, and a plurality of target knowledge points are respectively associated with a basic knowledge point label. The knowledge map comprises target knowledge points of different learning levels, different years and different learning subjects, is also understood as a knowledge domain visualization or knowledge domain mapping map, is a series of different graphic data for displaying the relationship between the knowledge development process and the structure, describes knowledge resources and carriers thereof by using a visualization technology, and excavates, analyzes, constructs, draws and displays knowledge and the mutual relation between the knowledge resources and the carriers.

Generally, before similarity calculation is performed on text data and a plurality of target knowledge points to obtain similarity values, course information corresponding to original data resources needs to be obtained, the course information is queried from a preset knowledge graph to obtain a plurality of target knowledge points corresponding to the course information, and the knowledge graph comprises the target knowledge points of different learning levels, different years and different learning subjects. Similarity calculation is carried out on the text data and the target knowledge points respectively, namely similarity values of the basic knowledge points and the target knowledge points in the text data are calculated, and basic knowledge point labels corresponding to the text data, namely basic knowledge point labels corresponding to original data resources, can be judged based on the similarity values. According to different original data resources, different corresponding course information can be acquired, a plurality of target knowledge points corresponding to the original data resources can be inquired from a preset knowledge graph based on the course information, but the original data resources may not completely contain all the target knowledge points inquired from the preset knowledge graph, and the original data resources may contain one or more target knowledge points inquired from the preset knowledge graph, so that the similarity value between the basic knowledge point in the original data resources and the target knowledge point in the preset knowledge graph needs to be analyzed, the target knowledge point contained in the original data resources can be determined based on the similarity value, and further, the basic knowledge point label which can be associated with the original data resources can be determined.

S303, carrying out sentence segmentation processing on the text data to obtain a sentence set, respectively carrying out blocking processing on the sentence set to obtain a word block set, and carrying out word segmentation processing to obtain a word set.

The sentence set refers to a set containing a plurality of sentences obtained by sentence segmentation processing on the text data, wherein the sentences in the sentence set are one or more complete sentences obtained by segmentation processing according to the text content, line feed characters, punctuation marks and the like of the text data. The word block set is a set including a plurality of phrases (word blocks) obtained by performing phrase division on each sentence in the sentence set. The term set refers to a set including a plurality of terms obtained by respectively carrying out term division on each sentence in the sentence set.

S304, calculating the importance degree weight of each word block in the word block set based on the key word extraction TF-IDF algorithm.

The importance degree weight refers to the weight of the importance degree corresponding to each word block obtained by analyzing the word block based on the keyword extraction TF-IDF algorithm, namely a TF-IDF value, and the weight refers to the frequency of each number in the weighted average and is also called weight or weight.

In general, TF-IDF (Term Frequency-Inverse Document Frequency) is a commonly used weighting technique for information retrieval and data mining, and is a statistical method for evaluating the importance of a word to one of a set of documents or a corpus, the importance of a word increasing in proportion to the number of times it appears in the Document, but decreasing in Inverse proportion to the Frequency of its appearance in the corpus. TF represents Term Frequency (Term Frequency) which represents the number of times the corpus appears divided by the total number of sentences in the question-answer library, and IDF represents weight which represents the inverse document Frequency of the appearance of the Term block. The main idea of IDF is: if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong.

The calculation process of the TF-IDF algorithm based on keyword extraction mainly comprises the following steps: calculating word frequency, comparing text contents with different text lengths, and standardizing word frequency of word blocks; calculating the frequency of the inverse document, simulating the use environment of the language by using a corpus (corpus), wherein if the occurrence frequency of a word is more, the denominator is larger, the frequency of the inverse document is smaller (closer to 0), and adding 1 to the denominator to avoid the situation that the denominator is 0 (namely all documents do not contain the word); and calculating TF-IDF values, wherein the TF-IDF values are in direct proportion to the occurrence times of a word in the text and in inverse proportion to the occurrence times of the word in the whole language, and after the TF-IDF values of all the words in the text are calculated, the words can be arranged in a descending order according to the TF-IDF values.

S305, taking the word block with the importance degree weight larger than the first preset weight as a reference content word, and taking the word block with the importance degree weight smaller than or equal to the second preset weight as a reference high-frequency word.

The first preset value is a basis for screening reference content vocabularies, and the first preset weight can be set according to actual needs. Reference to content vocabulary refers to words that contain certain meanings. The second predetermined weight is a basis for selecting the reference high-frequency vocabulary, and usually the first predetermined weight is smaller than the second predetermined weight.

Generally, after the importance degree of each word block in the word block set is scored based on the keyword extraction TF-IDF algorithm, the word blocks with the importance degree weights can be sorted in a descending order, the word block at the front 1/3 in the descending order is taken as a reference content word, and the word block at the rear 1/3 in the descending order is taken as a reference high-frequency word.

S306, calculating the similarity value of the reference content vocabulary and the target content vocabulary and the similarity value of the reference high-frequency vocabulary and the target high-frequency vocabulary.

The similarity value refers to a similarity relationship between two quantities to be compared, and generally, the greater the similarity value, the more similar the two quantities, where the similarity value may be a similarity value between the reference content vocabulary and the target content vocabulary and a similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary. The target content vocabulary and the target high-frequency vocabulary are both target knowledge points corresponding to the course information of the original data resources in the knowledge graph, and the reference content vocabulary and the reference high-frequency vocabulary are basic knowledge points in the text data of the original data resources.

S307, when the similarity value is larger than the similarity threshold value, acquiring a basic knowledge point label associated with the target content vocabulary and a basic knowledge point label associated with the target high-frequency vocabulary.

The similarity threshold is a lowest limit value at which the similarity value can meet the condition, that is, a similarity threshold, and according to the difference of the basic knowledge points, the similarity thresholds corresponding to the basic knowledge points may also be different, and the similarity threshold corresponding to the reference content vocabulary is different from the similarity threshold corresponding to the reference high-frequency vocabulary. The basic knowledge point labels are associated with the target knowledge points in the knowledge graph, and if the basic knowledge points of the original data resources comprise basic knowledge points with similarity values larger than the similarity threshold value with the target knowledge points in the knowledge graph, the basic knowledge points can be associated with the basic knowledge point labels of the corresponding target knowledge points. The labels are data used for describing the characteristics of the data resources, the label data corresponding to different data resources are different, the knowledge points, the learning content or the learning capacity related to the data resources can be effectively represented through the labels, and the data resources can be screened and analyzed by labeling the data resources with different labels.

For example, the following steps are carried out: partitioning (Chunk) the sentence sets { S1, S2 … Sn } obtained by segmentation processing according to sentences to obtain word block sets { B11, B12 … B1m1}, { B21, B22 … } … corresponding to each sentence, wherein as most of reference content vocabularies are vocabularies based on learning subjects, in order to increase the calculation speed, a TF-IDF algorithm based on keyword extraction is adopted to score the importance of each word block set { B11, B12 … B1m1}, { B21, B22 … } … to obtain the importance weight of the word block in each word block set, after the importance weights are sorted in a descending order, the word block at the front 1/3 is taken as the reference content vocabulary, namely, the word block with the importance weight larger than the first weight is taken as the reference content vocabulary, and similarity calculation is carried out on the reference content vocabulary in sequence, the reference content vocabulary is marked as Bi, and the target content vocabulary is marked as Kj, respectively calculating editing distance similarity sim _ raw1 of Bi and Kj, editing distance similarity sim _ lemma1 and semantic similarity sim _ sem1 after part-of-speech reduction, calculating a total similarity value voc _ score1 ═ voc α · sim _ raw + voc β · sim _ lemma + voc γ · sim _ sem according to the similarity, and labeling the reference content vocabulary Bi with a knowledge point label associated with the target content vocabulary Kj if the total similarity value is higher than a first preset similarity threshold voc _ score _ threshold 1.

For example, the following steps are carried out: partitioning (Chunk) a sentence set { S1, S2 … Sn } obtained by segmentation processing according to sentences to obtain a word block set { B11, B12 … B1m1}, { B21, B22 … } … corresponding to each sentence, wherein as most of reference content vocabularies are vocabularies based on learning subjects, in order to increase the calculation speed, a TF-IDF algorithm based on keyword extraction is adopted to perform importance degree partitioning on each word block set { B11, B12 … B1m1}, { B21, B22 … } … to obtain importance degree weights of the word blocks in each word block set, the importance degree weights are sorted in a descending order, the word blocks of 1/3 are taken as reference high-frequency vocabularies and are sequentially subjected to similarity calculation with the target high-frequency vocabularies, the high-frequency vocabularies are recorded as Ba, the target vocabularies are recorded as Kb, the editing distance similarity sim _ raw2 of Ba and Kb is respectively calculated, the similarity of the phrase _ 2 and the semantic similarity of SEmmsema _ 2 after morphological restoration, the total similarity value voc _ score2 is calculated from the above similarity, voc α · sim _ raw + voc β · sim _ lemma + voc γ · sim _ sem, and if the total similarity value is higher than a preset similarity threshold voc _ score _ threshold2, the reference high-frequency vocabulary Ba is labeled with a knowledge point label associated with the target high-frequency vocabulary Kb.

S308, adding the basic knowledge point label related to the target content vocabulary and the basic knowledge point label related to the target high-frequency vocabulary into the basic knowledge point label set.

The basic knowledge point label set comprises basic knowledge point labels corresponding to original data resources, the basic knowledge point labels included in the basic knowledge point label set are basic knowledge point labels associated with target knowledge points with similarity values larger than a similarity threshold, and the similarity thresholds corresponding to the basic knowledge point labels are different according to different basic knowledge points. The number of the basic knowledge point tags in the basic knowledge point tag set is related to the content of the original data resource, and the basic knowledge point tags in the basic knowledge point tag set are associated with the original data resource, that is, the basic knowledge point tags in the basic knowledge point tag set are equivalent to the basic knowledge point tags which are marked on the original data resource and correspond to the original data resource.

S309, performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set.

The part-of-speech tagging set refers to a part-of-speech set corresponding to each word in the word set, and each word in the word set and each part-of-speech in the part-of-speech tagging set have a one-to-one mapping relationship. The parts of speech in the part of speech tagging set can comprise nouns, verbs, adjectives, numerators, quantifiers, pronouns, adverbs, prepositions, conjunctions, helpers, sighs, vocabularies and the like, and the specific parts of speech are still related to the learning content in the original data resource.

Generally, the part-of-speech tagging process usually needs to determine the most appropriate part-of-speech corresponding to each word according to the context of the sentence, and there may be a situation of a plurality of words in the part-of-speech tagging process, like a word can be used as a noun or a verb, also called a facultative word, and the probability of the occurrence of such a word in a common word is very high. This can be solved by methods that exploit probabilities, such as: the labeling of such words can be handled using HMM (Hidden Markov Model). Besides, the part-of-speech tagging can be performed on the words based on a conversion idea or a classification idea.

And S310, taking the words with the parts of speech as verb vocabularies as reference verb vocabularies, and taking the words with the parts of speech as digit vocabularies as reference mathematical vocabularies.

Generally, after the parts of speech analysis is performed on the words in the word set, the parts of speech corresponding to the words contained in the original data resource can be obtained, so the words with parts of speech being verb parts of speech and digit parts of speech can be respectively screened out according to the parts of speech, the words with parts of speech being verb parts of speech are used as reference verb word knowledge points, and the words with parts of speech being digit parts of speech are used as reference math word knowledge points.

S311, calculating the similarity value of the reference verb vocabulary and the target verb vocabulary and the similarity value of the reference mathematical vocabulary and the target mathematical vocabulary.

The reference verb vocabulary and the reference mathematic vocabulary are basic knowledge points contained in the original data resources, and the target verb vocabulary and the target mathematic vocabulary are target knowledge points corresponding to course information of the original data resources in the knowledge graph. The similarity value refers to a similarity relationship between two quantities being compared, and generally a greater similarity value indicates a greater similarity between the two quantities, where the similarity value may be a similarity value of the reference verb vocabulary and the target verb vocabulary and a similarity value of the reference mathematical vocabulary and the target mathematical vocabulary.

S312, when the similarity value is larger than the similarity threshold value, acquiring a basic knowledge point label associated with the target verb vocabulary and a basic knowledge point label associated with the target mathematic vocabulary.

The similarity threshold is the lowest limit value at which the similarity value can meet the condition, that is, the similarity threshold, and according to the difference of the basic knowledge points, the similarity thresholds corresponding to the basic knowledge points may also be different, and the similarity threshold corresponding to the reference verb vocabulary is different from the similarity threshold corresponding to the reference mathematical vocabulary.

S313, adding the basic knowledge point label associated with the target verb vocabulary and the basic knowledge point label associated with the target mathematic vocabulary into the basic knowledge point label set.

The basic knowledge point labels are associated with the target knowledge points in the knowledge graph, and if the basic knowledge points of the original data resources include basic knowledge points with similarity values larger than the similarity threshold value with the target knowledge points in the knowledge graph, the basic knowledge points can be associated with the basic knowledge point labels of the corresponding target knowledge points. The labels are data used for describing the characteristics of the data resources, the label data corresponding to different data resources are different, the knowledge points, the learning content or the learning capacity related to the data resources can be effectively represented through the labels, and the data resources can be screened and analyzed by labeling the data resources with different labels.

For example, the following steps are carried out: carrying out sentence segmentation on a sentence set { S1, S2 … Sn } obtained by segmentation to obtain word sets { { T11, T12 … T1o1}, { T21, T22 … T2o2} … } corresponding to each sentence, carrying out part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set, taking the word with the part-of-speech being the verb part-of-speech of a verb as a reference verb word, and taking the word with the part-of-speech being the digit part-of-speech as a reference mathematic word; calculating similarity between a reference verb vocabulary and a target verb vocabulary in sequence, calculating similarity between a number word vocabulary and a target number word vocabulary in sequence, calculating similarity between the number word vocabulary and the target number word vocabulary in sequence, calculating number word vocabulary Va and target number word vocabulary Ub, calculating an editing distance similarity sim _ raw3 between the Vi and the Uj, editing distance similarity sim _ lemma3 and semantic similarity sim _ sem3 after word characteristic reduction, calculating a total similarity value voc _ score3 ═ voc α · sim _ raw + voc β · sim _ lemma + voc γ · sim _ sem according to the similarity, and labeling the reference verb vocabulary Vi with a knowledge point label associated with the target verb vocabulary Uj if the total similarity value is higher than a third preset similarity threshold voc _ score _ threshold 3; and calculating an editing distance similarity value sim _ raw4 of Va and Ub, respectively, editing distance similarity sim _ lemma4 and semantic similarity sim _ sem4 after the part of speech reduction, calculating a total similarity value voc _ score4 ═ voc α · sim _ raw + voc β · sim _ lemma + voc γ · sim _ sem according to the similarity, and labeling the number vocabulary Va with a knowledge point label associated with the target number vocabulary Ub if the total similarity value is higher than a fourth preset similarity threshold voc _ score _ threshold 4.

And S314, analyzing each word in the word set, and adding phonetic symbols to the phonetic symbols of each word to obtain a phonetic symbol set.

The phonetic symbol set is a set containing phonetic symbols corresponding to all words in the word set, and each word in the word set and each phonetic symbol in the phonetic symbol set have a one-to-one mapping relation.

S315, calculating similarity values of the word phonetic symbols in the phonetic symbol set and the target phonetic symbols.

The word phonetic symbols in the phonetic symbol set are basic knowledge points contained in the original data resources, and the target phonetic symbols are target knowledge points corresponding to course information of the original data resources in the knowledge graph. Similarity values refer to the similarity relationship between two quantities being compared, typically a greater similarity value indicates a greater similarity between the two quantities, where the similarity value may be the similarity value of the word phonetic symbol to the target phonetic symbol.

And S316, when the similarity value is larger than the similarity threshold value, acquiring a basic knowledge point label associated with the target phonetic symbol.

The similarity threshold is the lowest limit value at which the similarity value can meet the condition, that is, the similarity threshold, and according to the difference of the basic knowledge points, the similarity thresholds corresponding to the basic knowledge points may also be different, and the similarity thresholds corresponding to the word phonetic symbols may be arbitrarily set as required, and may also be different from the similarity threshold described above.

And S317, adding the basic knowledge point label associated with the target phonetic symbol into the basic knowledge point label set.

For example, the following steps are carried out: carrying out sentence segmentation processing on a sentence set { S1, S2 … Sn } to obtain a word set { { T11, T12 … T1o1}, { T21, T22 … T2o2} … } corresponding to each sentence, converting the words in the word set into corresponding phonetic symbol sets { { P11, P12 … P1o1}, { P21, P22 … P2o2} … } through a dictionary tool, sequentially carrying out similarity calculation on the phonetic symbols Pi in the phonetic symbol set and the target phonetic symbols Kj, wherein the similarity calculation includes whether a pronunciation combination is in the source word Ti corresponding to the phonetic symbols Pi and includes similarity sim _ in, edit distance similarity sim _ edge of Pi and Kj, and common similarity sim _ lc of the phonetic symbols Pi and the target phonetic symbols Kj, and calculating similarity value phon _ sin + sin _ score according to the score, if the similarity value is preset phon _ sin + sin _ score, the word phonetic symbol Pi is labeled with the knowledge point label associated with the target phonetic symbol Kj.

S318, carrying out dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree.

The dependency syntax tree is a relationship tree that can describe the dependency relationship between words in the data resource, and can represent the collocations of words in syntax, and the collocations are related to semantics. The basic task of dependency parsing is to determine the syntactic structure of a sentence or the dependency between words in the sentence, which mainly includes two aspects: determining a grammar system of the language, namely giving formal definition to a grammar structure of a legal sentence in the language; the syntactic analysis technology is to automatically derive the syntactic structure of a sentence according to a given syntactic system, and analyze the syntactic units contained in the sentence and the relationship between the syntactic units.

And S319, calculating similarity values of the words in the word set, the parts of speech in the part of speech tagging set and the dependency syntax tree corresponding to the words, the parts of speech and the syntax tree in the target sentence pattern respectively.

The similarity value refers to a similarity relationship between the two compared quantities, and generally, the greater the similarity value, the more similar the two quantities are, where the similarity value may be a similarity value corresponding to a word in the word set of the original data resource, a part of speech in the part of speech tag set, and a syntax tree depending on the word, the part of speech, and the syntax tree in the target sentence pattern.

S320, when the similarity value is larger than the similarity threshold value, acquiring the basic knowledge point label associated with the target sentence pattern.

The similarity threshold is a lowest limit value at which the similarity value can meet the condition, that is, a similarity threshold value, and may be different according to different basic knowledge points, and the similarity thresholds corresponding to the words in the word set, the parts of speech in the part of speech tag set, and the dependency syntax tree may be arbitrarily set as required, and may also be different from the similarity threshold described above.

S321, adding the basic knowledge point labels associated with the target sentence patterns into the basic knowledge point label set.

For example, the following steps are carried out: performing word segmentation processing on a sentence set { S1, S1 Sn } obtained by segmentation processing according to sentences to obtain a word set { { T1, T1T 1o1}, { T1, T1T 2o 1} 1 corresponding to each sentence, performing part-of-speech tagging processing and dependency syntax analysis processing on the sentence set { S1, S1 Sn } to obtain part-of-speech tagging sets { { Pos1, Pos1 Pos1o1}, { Pos1, Pos1 Pos2o 1} 1} dependency syntax trees { Tree1, Tree1 Treen }, sequentially calculating example sets { Ti1, Ti1 Time } of the sentence sets { Ti1, T1o1} and target formula knowledge set { KTj1, KTJJJJJJJJJJJJJJJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJkJ, and (3) the tree similarity sim _ tree of the Treei and the target sentence pattern knowledge point example sentence KTreej, calculating a total similarity value send _ score as send alpha. sim _ token _ jaccard + send beta. sim _ pos _ edge + send gamma. sim _ pos _ lcs + send theta. sim _ tree according to the similarity values, and labeling the sentence Si with a basic knowledge point label associated with the Kj sentence pattern if the total similarity value is higher than a preset similarity threshold value send _ score _ threshold.

S322, calculating the similarity value between the grammar contained in the sentence set and the target grammar based on the word set and the part of speech tagging set.

The grammar contained in the sentence set is a basic knowledge point of the original data resource, and the target grammar is a target knowledge point corresponding to the course information of the original data resource in the knowledge map. The similarity value refers to a similarity relationship between two compared quantities, and generally, the greater the similarity value, the more similar the two quantities are, where the similarity value may be a similarity value of a grammar in a word set of an original data resource corresponding to a target grammar.

And S323, when the similarity value is larger than the similarity threshold value, acquiring a basic knowledge point label associated with the target grammar.

The similarity threshold refers to a lowest limit value at which a similarity value can meet a condition, that is, a similarity threshold, and according to the difference of the basic knowledge points, the similarity thresholds corresponding to the basic knowledge points may also be different, and the similarity thresholds corresponding to the grammars in the word set may be arbitrarily set as needed, and may also be different from the similarity thresholds described above.

S324, adding the basic knowledge point label associated with the target grammar into the basic knowledge point label set.

For example, the following steps are carried out: the sentence sets { S1, S2 … Sn } obtained by the segmentation processing are subjected to sentence-by-sentence segmentation processing to obtain word sets { { T11, T12 … T1o1}, { T21, T22 … T2o2} 2 corresponding to the respective sentences, the sentence sets { S2, S2 Sn } are subjected to part-of-speech tagging to obtain part-of-speech tagging sets { { Pos2, Pos2 Pos 361 o2}, { Pos2, Pos2 Pos2o2} 2}, and whether a grammar fragment is included in the word sets { Ti2, Ti2 Tim }, Ti2 ═ T2, T2T 1o 2}, Ti2 ═ { T2, T2o2} 2, or the part-of-speech sets { Pos j2, KPosj } 2, kt _ po _ 72, kto 2} 2, kt _ po _ o2} 2, kt — p2, kt β + p β + kt β + p β, and kt β + p β + k β, where the grammar set { Ti- β } 2 is included in the similarity of the sentence sets { Ti- α + kt 2, the sentence sets { p2, kt α + p2, the similarity of the sentence sets { p α + p2, k α + p2 If the total similarity value is higher than the preset similarity threshold gram _ score _ threshold, labeling the sentence Si with the basic knowledge point label associated with the target sentence Kj grammar.

For example, the following steps are carried out: the basic knowledge point labels and the number corresponding to the obtained original data resources are calculated and can be recorded as: tag (text) ({ voc: num1, verb: num2, math: num3, hfw: num4, phon: num5, sent: num5, gram: num7| num ═ 0 }. The whole process of calculating the similarity between the basic knowledge point of the original data resource and the target knowledge point corresponding to the course information of the original data resource in the knowledge graph can be seen in fig. 4.

S325, generating a comprehensive knowledge point label set of the original data resource according to the characteristic information of the original data resource and the basic knowledge point label set.

When the scheme of the embodiment of the application is executed, the server preprocesses original data resources to obtain text data, inquires attribute information corresponding to the original data resources from a preset knowledge map to obtain a plurality of target knowledge points corresponding to the attribute information, divides the text data to obtain a sentence set, divides the sentence set to obtain a word block set, extracts an importance degree weight of each word block in the word block set based on a keyword extraction TF-IDF algorithm, uses the word block with the importance degree weight larger than a first preset weight as a reference content vocabulary, uses the word block with the importance degree weight smaller than or equal to a second preset weight as a reference high-frequency vocabulary, calculates the reference content vocabulary and the target content vocabulary and the similarity value between the reference high-frequency vocabulary and the target high-frequency vocabulary, when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with a target content vocabulary and a basic knowledge point label associated with a target high-frequency vocabulary, adding the basic knowledge point label associated with the target content vocabulary and the basic knowledge point label associated with the target high-frequency vocabulary into a basic knowledge point label set, performing part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set, taking the word with the part-of-speech being verb part-of-speech verb as a reference verb vocabulary, taking the word with the part-of-speech being part-of-speech digit as a reference mathematical vocabulary, calculating a similarity value between the reference verb vocabulary and the target verb, and a similarity value between the reference mathematical vocabulary and the target mathematical vocabulary, acquiring a basic knowledge point label associated with the target verb and a basic knowledge point label associated with the target mathematical vocabulary when the similarity value is greater than a similarity threshold, and adding the basic knowledge point label associated with the target verb and the basic knowledge point label associated with the target mathematical Analyzing each word in the word set in the knowledge point label set, obtaining a phonetic symbol set for the phonetic symbol on each word phonetic symbol, calculating the similarity value between the phonetic symbol of the word in the phonetic symbol set and the target phonetic symbol, obtaining the basic knowledge point label associated with the target phonetic symbol when the similarity value is larger than the similarity threshold value, adding the basic knowledge point label associated with the target phonetic symbol into the basic knowledge point label set, performing dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree, calculating the similarity values of the words in the word set, the part of speech in the part of speech tagging set and the dependency syntax tree respectively corresponding to the words, the part of speech and the syntax tree in the target sentence pattern, obtaining the basic knowledge point label associated with the target sentence pattern when the similarity value is larger than the similarity threshold value, adding the basic knowledge point label associated with the target sentence pattern into the basic knowledge point label set, and calculating similarity values of the grammars contained in the sentence sets and the target grammars based on the word sets and the part-of-speech tagging sets, acquiring basic knowledge point labels associated with the target grammars when the similarity values are larger than a similarity threshold value, adding the basic knowledge point labels associated with the target grammars into the basic knowledge point label sets, and generating comprehensive knowledge point label sets of the original data resources according to the characteristic information of the original data resources and the basic knowledge point label sets. By the method, the related knowledge point labels can be quickly and accurately marked on the original data resources, and the marking efficiency and the marking accuracy are improved.

As described above, the embodiments are mainly described in the online education industry, but those skilled in the art will understand that the method is not limited to the online education industry, and the method described in the present application can be applied to user tag processing in various industries such as retail, transportation, social, search, education, medical, etc.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Please refer to fig. 5, which illustrates a schematic structural diagram of a data resource annotation device according to an exemplary embodiment of the present application. Hereinafter referred to as the apparatus 5, the apparatus 5 may be implemented as all or a part of the terminal by software, hardware or a combination of both. The device 5 comprises a preprocessing module 501, a calculation module 502, a first processing module 503 and a second processing module 504.

A preprocessing module 501, configured to preprocess an original data resource to obtain text data;

a calculating module 502, configured to perform similarity calculation on the text data and the multiple target knowledge points respectively to obtain similarity values; wherein each of the plurality of target knowledge points is associated with a basic knowledge point tag;

a first processing module 503, configured to generate a basic knowledge point tag set of the original data resource according to a comparison result between the similarity value and the similarity threshold; wherein, the basic knowledge point labels included in the basic knowledge point label set are: basic knowledge point labels associated with the target knowledge points with similarity values larger than a similarity threshold value;

a second processing module 504, configured to generate a comprehensive knowledge point tag set of the original data resource according to the feature information of the original data resource and the basic knowledge point tag set.

Optionally, the apparatus 5 further comprises:

a query unit, configured to query attribute information corresponding to the original data resource from a preset knowledge graph to obtain the multiple target knowledge points corresponding to the attribute information

Optionally, the original data resource in the device 5 is a teaching resource, and the attribute information is course information.

Optionally, the first processing module 503 includes:

the first processing unit is used for carrying out sentence segmentation processing on the text data to obtain a sentence set, respectively carrying out blocking processing on the sentence set to obtain a word block set, and carrying out word segmentation processing to obtain a word set;

the first calculation unit is used for analyzing the word block set and the word set to obtain a reference word set and calculating the similarity value of each reference word in the reference word set and the corresponding target knowledge point; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference mathematic vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference mathematic vocabulary is a target mathematic vocabulary;

a first adding unit, configured to add, when the similarity value of each corresponding target knowledge point is greater than the similarity threshold value of each corresponding target knowledge point, a basic knowledge point tag corresponding to the corresponding target knowledge point to the basic knowledge point tag set; or

The second processing unit is used for carrying out sentence segmentation processing on the text data to obtain a sentence set and carrying out word segmentation processing on the sentence set respectively to obtain a word set;

the second calculation unit is used for analyzing the word set to respectively obtain a phonetic symbol set, a part of speech tagging set and a dependency syntax tree, and calculating similarity values of word phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammars in the sentence set and corresponding target knowledge points; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammars are target grammars;

and a second adding unit, configured to add, when the similarity value of each corresponding target knowledge point is greater than the similarity threshold, a basic knowledge point tag corresponding to the corresponding target knowledge point to the basic knowledge point tag set.

Optionally, the first processing module 503 includes:

the third processing unit is used for carrying out sentence segmentation processing on the learning text data to obtain a sentence set and carrying out blocking processing on the sentence set to obtain a word block set;

the third calculation unit is used for calculating the importance degree weight of each word block in the word block set based on the key word extraction TF-IDF algorithm;

the first selection unit is used for taking the word block with the importance degree weight larger than a first preset weight as the reference content vocabulary;

a fourth calculating unit configured to calculate a similarity value between the reference content vocabulary and the target content vocabulary;

the first acquisition unit is used for acquiring a basic knowledge point label associated with the target content vocabulary when the similarity value is greater than a similarity threshold value;

and the first adding unit is used for adding the basic knowledge point label associated with the target content vocabulary into the basic knowledge point label set.

Optionally, the first processing module 503 includes:

the fourth processing unit is used for carrying out sentence segmentation processing on the text data to obtain a sentence set and carrying out blocking processing on the sentence set to obtain a word block set;

the fourth calculation unit is used for calculating the importance degree weight of each word block in the word block set based on the key word extraction TF-IDF algorithm;

the second selection unit is used for taking the word block of which the importance degree weight is less than or equal to a second preset weight as the reference high-frequency vocabulary;

a fifth calculating unit, configured to calculate similarity values between the reference high-frequency vocabulary and the target high-frequency vocabulary;

the second acquisition unit is used for acquiring a basic knowledge point label associated with the target high-frequency vocabulary when the similarity value is greater than the similarity threshold value;

and the second adding unit is used for adding the basic knowledge point label associated with the target high-frequency vocabulary into the basic knowledge point label set.

Optionally, the first processing module 503 includes:

the fifth processing unit is used for carrying out sentence segmentation processing on the text data to obtain a sentence set and carrying out word segmentation processing on the sentence set to obtain a word set;

the first labeling unit is used for performing part-of-speech labeling on each word in the word set to obtain a part-of-speech labeling set;

a third selecting unit, configured to use the word whose part of speech is the verb part of speech as a reference verb word;

a sixth calculating unit configured to calculate a similarity value between the reference verb vocabulary and the target verb vocabulary;

the third obtaining unit is used for obtaining a basic knowledge point label associated with the target verb vocabulary when the similarity value is larger than the similarity threshold value;

and the third adding unit is used for adding the basic knowledge point label associated with the target verb vocabulary into the basic knowledge point label set.

Optionally, the first processing module 503 includes:

a sixth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

the second labeling unit is used for performing part-of-speech labeling on each word in the word set to obtain a part-of-speech labeling set;

the fourth selection unit is used for taking the words with the part of speech being the part of speech of the digital words as reference mathematical words;

a seventh calculating unit, configured to calculate a similarity value between the reference mathematical vocabulary and the target mathematical vocabulary;

the fourth acquisition unit is used for acquiring a basic knowledge point label associated with the target mathematical vocabulary when the similarity value is greater than the similarity threshold value;

and the fourth adding unit is used for adding the basic knowledge point label associated with the target mathematical vocabulary into the basic knowledge point label set.

Optionally, the first processing module 503 includes:

a seventh processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

the third labeling unit is used for analyzing each word in the word set and obtaining a phonetic symbol set for phonetic symbols on the word phonetic symbols;

an eighth calculating unit, configured to calculate similarity values between word phonetic symbols in the phonetic symbol set and the target phonetic symbol;

a fifth selecting unit, configured to obtain a basic knowledge point tag associated with the target phonetic symbol when the similarity value is greater than the similarity threshold;

and a fifth adding unit, configured to add the basic knowledge point tag associated with the target phonetic symbol to the basic knowledge point tag set.

Optionally, the first processing module 503 includes:

the eighth processing unit is used for carrying out sentence segmentation processing on the text data to obtain a sentence set and carrying out word segmentation processing on the sentence set to obtain a word set;

the fourth labeling unit is used for performing part-of-speech labeling on each word in the word set based on the word set to obtain a part-of-speech labeling set;

the analysis unit is used for carrying out dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree;

a ninth calculating unit, configured to calculate similarity values corresponding to the words in the word set, the parts of speech in the part of speech tagging set, and the dependency syntax tree in the target sentence pattern, respectively;

a fifth obtaining unit, configured to obtain a basic knowledge point tag associated with the target sentence pattern when the similarity value is greater than the similarity threshold;

and a sixth adding unit, configured to add the basic knowledge point tag associated with the target sentence pattern to the basic knowledge point tag set.

Optionally, the first processing module 503 includes:

a ninth processing unit, configured to perform sentence segmentation processing on the text data to obtain a sentence set, and perform word segmentation processing on the sentence set to obtain a word set;

a fifth tagging unit, configured to perform part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;

a tenth calculating unit, configured to calculate a similarity value between a grammar included in the sentence set and a target grammar, based on the word set and the part-of-speech tagging set;

a sixth obtaining unit, configured to obtain a basic knowledge point tag associated with the target grammar when the similarity value is greater than a similarity threshold;

and a seventh adding unit, configured to add the basic knowledge point tag associated with the target grammar to the basic knowledge point tag set.

It should be noted that, when the apparatus 5 provided in the foregoing embodiment executes the method for tagging a data resource, only the division of each functional module is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the embodiments of the data resource labeling method provided in the foregoing embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

Fig. 6 is a schematic structural diagram of a data resource tagging apparatus provided in an embodiment of the present application, which is hereinafter referred to as an apparatus 6 for short, where the apparatus 6 may be integrated in the foregoing server or terminal device, as shown in fig. 6, the apparatus includes: memory 602, processor 601, input device 603, output device 604, and a communication interface.

The memory 602 may be a separate physical unit, and may be connected to the processor 601, the input device 603, and the output device 604 via a bus. The memory 602, processor 601, input device 603, and output device 604 may also be integrated, implemented in hardware, etc.

The memory 602 is used for storing a program for implementing the above method embodiment, or various modules of the apparatus embodiment, and the processor 601 calls the program to execute the operations of the above method embodiment.

Input devices 602 include, but are not limited to, a keyboard, a mouse, a touch panel, a camera, and a microphone; the output device includes, but is not limited to, a display screen.

Communication interfaces are used to send and receive various types of messages and include, but are not limited to, wireless interfaces or wired interfaces.

Alternatively, when part or all of the distributed task scheduling method of the above embodiments is implemented by software, the apparatus may also include only a processor. The memory for storing the program is located outside the device and the processor is connected to the memory by means of circuits/wires for reading and executing the program stored in the memory.

The processor may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory may include volatile memory (volatile memory), such as random-access memory (RAM); the memory may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory may also comprise a combination of memories of the kind described above.

Wherein the processor 601 calls the program code in the memory 602 for performing the following steps:

preprocessing original data resources to obtain text data;

respectively carrying out similarity value calculation on the text data and a plurality of target knowledge points to obtain similarity values; wherein each of the plurality of target knowledge points is associated with a basic knowledge point tag;

In one or more embodiments, processor 601 is further configured to:

and inquiring attribute information corresponding to the original data resource from a preset knowledge graph to obtain the target knowledge points corresponding to the attribute information.

In one or more embodiments, processor 601 is further configured to:

carrying out sentence segmentation processing on the text data to obtain a sentence set, respectively carrying out blocking processing on the sentence set to obtain a word block set, and carrying out word segmentation processing to obtain a word set;

analyzing the word block set and the word set to obtain a reference word set, and calculating similarity values of all reference words in the reference word set and corresponding target knowledge points; the reference vocabulary set comprises a reference content vocabulary, a reference high-frequency vocabulary, a reference verb vocabulary and a reference mathematic vocabulary, wherein a target knowledge point corresponding to the reference content vocabulary is a target content vocabulary, a target knowledge point corresponding to the reference high-frequency vocabulary is a target high-frequency vocabulary, a target knowledge point corresponding to the reference verb vocabulary is a target verb vocabulary, and a target knowledge point corresponding to the reference mathematic vocabulary is a target mathematic vocabulary;

when the similarity value of the respective corresponding target knowledge point is greater than the respective corresponding similarity threshold value, adding the basic knowledge point label corresponding to the respective corresponding target knowledge point into the basic knowledge point label set; or

Carrying out sentence segmentation processing on the text data to obtain a sentence set, and carrying out word segmentation processing on the sentence set respectively to obtain a word set;

analyzing the word set to respectively obtain a phonetic symbol set, a part of speech tagging set and a dependency syntax tree, and calculating similarity values of words phonetic symbols in the phonetic symbol set, sentence patterns in the sentence set and grammars in the sentence set and corresponding target knowledge points; the target knowledge points corresponding to the word phonetic symbols are target phonetic symbols, the target knowledge points corresponding to the sentence patterns are target sentence patterns, and the target knowledge points corresponding to the grammars are target grammars;

and when the similarity value of the respective corresponding target knowledge point is greater than the respective corresponding similarity threshold value, adding the basic knowledge point label corresponding to the respective corresponding target knowledge point into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

carrying out sentence segmentation processing on the text data to obtain a sentence set, and carrying out blocking processing on the sentence set to obtain a word block set;

calculating the importance degree weight of each word block in the word block set based on a key word extraction TF-IDF algorithm;

taking the word block with the importance degree weight larger than a first preset weight as the reference content vocabulary;

calculating similarity values of the reference content vocabularies and the target content vocabularies;

when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target content vocabulary;

and adding the basic knowledge point label associated with the target content vocabulary into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

taking the word block with the importance degree weight value less than or equal to a second preset weight value as the reference high-frequency vocabulary;

calculating similarity values of the reference high-frequency vocabulary and the target high-frequency vocabulary;

when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target high-frequency vocabulary;

and adding the basic knowledge point label associated with the target high-frequency vocabulary into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

carrying out sentence segmentation processing on the text data to obtain a sentence set, and carrying out word segmentation processing on the sentence set to obtain a word set;

performing part-of-speech tagging on each word in the word set to obtain a part-of-speech tagging set;

taking the words with the part of speech being verb part of speech as reference verb words;

calculating a similarity value of the reference verb vocabulary and the target verb vocabulary;

when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target verb vocabulary;

and adding the basic knowledge point label associated with the target verb vocabulary into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

taking the words with the part of speech being the part of speech of the number words as reference mathematical words;

calculating the similarity value of the reference mathematical vocabulary and the target mathematical vocabulary;

when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target mathematical vocabulary;

and adding the basic knowledge point label associated with the target mathematical vocabulary into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

analyzing each word in the word set, and obtaining a phonetic symbol set for phonetic symbols on the word phonetic symbols;

calculating similarity values of the word phonetic symbols in the phonetic symbol set and the target phonetic symbols;

when the similarity value is larger than the similarity threshold value, acquiring a basic knowledge point label associated with the target phonetic symbol;

and adding the basic knowledge point label associated with the target phonetic symbol into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

performing part-of-speech tagging on each word in the word set based on the word set to obtain a part-of-speech tagging set;

carrying out dependency syntax analysis on the sentences in the sentence set to obtain a dependency syntax tree;

calculating similarity values of the words in the word set, the parts of speech in the part of speech tagging set and the dependency syntax tree which respectively correspond to the words, the parts of speech and the syntax tree in the target sentence pattern;

when the similarity value is larger than the similarity threshold value, acquiring a basic knowledge point label associated with the target sentence pattern;

and adding the basic knowledge point label associated with the target sentence pattern into the basic knowledge point label set.

In one or more embodiments, processor 601 is further configured to:

calculating similarity values of grammars contained in the sentence sets and target grammars based on the word sets and the part-of-speech tagging sets;

when the similarity value is larger than a similarity threshold value, acquiring a basic knowledge point label associated with the target grammar;

and adding the basic knowledge point label associated with the target grammar into the basic knowledge point label set.

It should be noted that, when the apparatus 6 provided in the foregoing embodiment executes the method for tagging a data resource, only the division of each functional module is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the embodiments of the data resource labeling method provided in the foregoing embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the method steps in the embodiments shown in fig. 2 to fig. 3, and a specific execution process may refer to specific descriptions of the embodiments shown in fig. 2 to fig. 3, which is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Claims

1. A method for annotating data resources, the method comprising:

preprocessing original data resources to obtain text data;

2. The method of claim 1, wherein the determining of the plurality of target knowledge points comprises:

3. The method of claim 2, wherein the raw data resource is an instructional resource and the attribute information is lesson information.

4. The method of claim 1, wherein generating the set of basic knowledge point tags for the original data resource according to the comparison result of the similarity value and the similarity threshold comprises:

5. The method of claim 4, wherein the target knowledge point comprises: a target content vocabulary;

the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, respectively carrying out blocking processing on the sentence set to obtain a word block set, and carrying out word segmentation processing to obtain a word set;

when the similarity value of the respective corresponding target knowledge point is greater than the respective corresponding similarity threshold, adding the basic knowledge point label corresponding to the respective corresponding target knowledge point to the basic knowledge point label set, including:

carrying out sentence segmentation processing on the learning text data to obtain a sentence set, and carrying out blocking processing on the sentence set to obtain a word block set;

6. The method of claim 4, wherein the target knowledge point comprises: target high-frequency words;

7. The method of claim 4, wherein the target knowledge point comprises: a target verb vocabulary;

carrying out sentence segmentation processing on the learning text data to obtain a sentence set, and carrying out word segmentation processing on the sentence set to obtain a word set;

8. The method of claim 4, wherein the target knowledge point comprises: a target mathematical vocabulary;

9. The method of claim 4, wherein the target knowledge point comprises: a target phonetic symbol;

the method comprises the steps of carrying out sentence segmentation processing on text data to obtain a sentence set, and carrying out word segmentation processing on the sentence set to obtain a word set;

analyzing each word in the word set, and adding phonetic symbols on the phonetic symbols of each word to obtain a phonetic symbol set;

10. The method of claim 4, wherein the target knowledge point comprises: a target sentence pattern;

11. The method of claim 4, wherein the target knowledge point comprises: a target grammar;

12. An apparatus for annotating data resources, the apparatus comprising:

13. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 11.

14. An electronic device, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 11.