CN112732867A - File processing method and device - Google Patents

File processing method and device Download PDF

Info

Publication number
CN112732867A
CN112732867A CN202011602808.9A CN202011602808A CN112732867A CN 112732867 A CN112732867 A CN 112732867A CN 202011602808 A CN202011602808 A CN 202011602808A CN 112732867 A CN112732867 A CN 112732867A
Authority
CN
China
Prior art keywords
resource
file
files
cluster
newly added
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011602808.9A
Other languages
Chinese (zh)
Other versions
CN112732867B (en
Inventor
陈静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Original Assignee
Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shiyuan Electronics Thecnology Co Ltd filed Critical Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority to CN202011602808.9A priority Critical patent/CN112732867B/en
Publication of CN112732867A publication Critical patent/CN112732867A/en
Application granted granted Critical
Publication of CN112732867B publication Critical patent/CN112732867B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a file processing method and device. Wherein, the method comprises the following steps: acquiring a plurality of resource files and constructing characteristic information of each resource file; clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; and extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package. The method solves the technical problem that a teacher cannot accurately find the appropriate teaching resources due to a single teaching resource recommendation method in the prior art.

Description

File processing method and device
Technical Field
The invention relates to the field of data processing, in particular to a file processing method and device.
Background
Along with the popularization of online education, electronization resources are more and more abundant, the selection of teachers is enriched due to the mass increase of the resources, more possibilities are brought to teaching, and teachers can use various resources in the teaching, enrich classroom contents and activate classroom atmosphere. However, when facing a large amount of electronic resources, the teacher often has difficulty in quickly and accurately selecting the desired resources. In order to improve the matching efficiency of resources, a method of single resource recommendation, such as exercise recommendation, is often used in the prior art. However, in a complete teaching process, a teacher needs to use a combination of multiple types of resources to cover various types of resources that the teacher needs to use when teaching a certain specific content, for example, a lesson needs to prepare a courseware, a classroom or a post-class practice needs to be performed, and a demonstration animation or a knowledge point explanation video is used to consolidate the knowledge of a student or improve the interest.
Aiming at the problem that a teacher cannot accurately find a proper teaching resource due to a single teaching resource recommendation method in the prior art, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a file processing method and device, and at least solves the technical problem that a teacher cannot accurately find a proper teaching resource due to a single teaching resource recommendation method in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a file processing method, including: acquiring a plurality of resource files and constructing characteristic information of each resource file; clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; and extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
Further, acquiring a plurality of resource files and constructing characteristic information of each resource file, including: acquiring text information in a resource file, and segmenting the text information; cleaning the word segmentation result by using the stop word list; and performing text vectorization processing based on the cleaning result to obtain a text vector for representing the characteristic information.
Further, in the case that the resource file is a video file, acquiring text information in the resource file includes: under the condition that the video file comprises subtitle data, acquiring the subtitle data to obtain text information in the video file; in the case where the video file does not include subtitle data, voice information in the video file is extracted and converted into text information.
Further, the method further comprises: creating a deactivation word list corresponding to the file type of the resource file, wherein creating the deactivation word list corresponding to the file type of the resource file comprises: performing word segmentation on a total number of resource files in a resource library, wherein the resource library comprises a plurality of types of resource files; screening out stop words corresponding to each type of resource files from word segmentation results of the full amount of resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files; and generating a deactivation word list corresponding to the file type according to the deactivation words corresponding to each type of resource file.
Further, the cleaning of the word segmentation result by using the word list comprises the following steps: and cleaning the word segmentation result through a stop word list corresponding to the file type of the resource file.
Further, after performing text vectorization processing based on the preprocessing result to obtain a text vector for representing the feature information, the method further includes one or more of: carrying out scaling processing on the text vector through an activation function; and performing dimension reduction processing on the text vector.
Further, clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters, including: and clustering the plurality of resource files based on the characteristic information of each resource file through a K-means clustering algorithm to generate a plurality of resource clusters.
Further, after clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters, the method further comprises: receiving a newly added resource file and constructing the characteristic information of the newly added resource file; determining a neighbor file of the newly added resource file according to the characteristic information of the newly added resource file and the characteristic information of the existing resource file; and dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to the distance relationship between the newly added resource file and the adjacent file.
Further, under the condition that the neighboring files all belong to the same first target resource cluster, dividing the newly added resource into any one of the resource clusters or regenerating one resource cluster for the newly added resource according to the distance relationship between the newly added resource file and the neighboring files, comprising: acquiring a first distance between the newly added resource file and the mass center of the first target resource cluster; acquiring a second distance between a resource file which is farthest from the centroid in the first target resource cluster and the centroid; acquiring the average distance between all resource files in the first target resource cluster and the centroid; under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance, dividing the newly added resource into a first target resource cluster; and under the condition that the difference between the first distance and the second distance is larger than the average distance, regenerating a resource cluster for the newly added resource.
Further, under the condition that the neighboring files do not belong to the same resource cluster, dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to the distance relationship between the newly added resource file and the neighboring files, comprising: acquiring the occupation ratio of a resource cluster to which a neighbor file belongs; under the condition that a second target resource cluster and a third target resource cluster exist, the proportion difference of which is smaller than a first preset value, a first average value of the distance between the newly added resource and a neighbor file belonging to the second target resource cluster, a second average value of the distance between the newly added resource and a neighbor file belonging to the third target resource cluster, a third average value of the distance between neighbor files in the second target resource cluster and a fourth average value of the distance between neighbor files in the third target resource cluster are obtained; if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and a neighbor file in the second target resource cluster into a third target resource cluster, wherein the occupation ratio of the third target resource cluster is higher than that of the second target resource cluster, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than a third average value, and the second average value is smaller than a fourth average value; and if the first average value, the second average value, the third average value and the fourth average value do not meet the preset conditions, acquiring a resource cluster to which a centroid with the shortest distance to the newly added resource file belongs, and adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
Further, each resource file has a corresponding file rank, each resource cluster has a corresponding theme, and according to the received resource file extraction request, the resource files are extracted from at least one resource cluster to form a file package, and the file package is returned, including: the resource file extracts request information in the request, wherein the request information comprises at least one of the following items: extracting a theme, extracting a file grade and extracting quantity corresponding to each file type; screening out resource files meeting the extracted file grade from the resource clusters same as the extracted theme under the condition that the resource file extraction request comprises the extracted theme and the extracted file grade; and under the condition that the resource file extraction request comprises the extraction quantity, randomly extracting the resource files corresponding to the extraction quantity from the resource files conforming to the extraction file grade to form a file package, and returning the file package.
Further, under the condition that the resource file extraction request does not include the extraction quantity, the extraction quantity is determined according to the historical extraction behavior of the extraction subject, or the resource files are extracted according to the preset extraction quantity; under the condition that the resource file extraction request does not include the extraction subject, extracting resource files from the first N resource clusters with the resource file quantity sorted from high to low; and under the condition that the resource file extraction request does not comprise the extraction file grade, randomly extracting resource files corresponding to the extraction quantity from the resource clusters with the same extraction subjects to form a file package.
According to another aspect of the embodiments of the present invention, there is also provided a file processing apparatus, including: the acquisition module is used for acquiring a plurality of resource files and constructing the characteristic information of each resource file; the clustering module is used for clustering the resource files based on the characteristic information of each resource file to generate a plurality of resource clusters; and the composition module is used for extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request and returning the file package.
According to another aspect of embodiments of the present invention, there is also provided a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any one of the above.
According to another aspect of the embodiments of the present invention, there is also provided an intelligent interactive tablet, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of the above.
In the embodiment of the invention, the characteristic information of the resource file is constructed, the resource clusters are obtained by clustering a plurality of resource files, and the combination of a plurality of resources can be obtained from the resource clusters according to the resource file extraction request to form the file package. The resource file combination method can be used for generating lesson preparation packages for teaching, multi-resource combinations which are similar in content and suitable for matched use can be generated by constructing text features of resource files in the education field and correspondingly clustering, teachers are helped to quickly establish lesson preparation packages meeting requirements, time for teachers to search resources and match different types of resources is reduced, the technical problem that teachers cannot accurately find suitable teaching resources due to a single teaching resource recommendation method in the prior art is solved, and teaching efficiency is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method of processing a file according to an embodiment of the invention;
FIG. 2 is a flow chart of an alternative method of processing files in accordance with an embodiment of the present invention;
FIG. 3 is a flow diagram of an alternative method for building profile information for a resource file, according to an embodiment of the invention;
FIG. 4 is a schematic diagram of a document processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an alternative smart interactive tablet in accordance with embodiments of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the implementation of the invention, a combination of multiple types of resources is defined as a lesson preparation package, and the multiple types of resources in the lesson preparation package are mutually related and have consistency in resource content. In the prior art, the tags of the resources themselves are usually borrowed to realize the association between the resources, for example, associating the titles, courseware, videos and the like under the same chapter together. However, even in the same chapter, the content related to different resources may be different, for example, the scenes of the resource examples are different and are not suitable for matching use, so that the method for associating resources in the prior art is not accurate, and the teacher cannot accurately acquire the expected lesson preparation package.
Example 1
In accordance with an embodiment of the present invention, there is provided a file processing method embodiment, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of a processing method of a file according to an embodiment of the present invention, as shown in fig. 1, the method includes the steps of:
step S101, a plurality of resource files are obtained, and characteristic information of each resource file is constructed.
The plurality of resource files refer to the same type or different types of resources associated with the user's needs. In an alternative embodiment of generating the lesson preparation package, the plurality of resource files may include, but are not limited to, topics, courseware, video, and other multimedia resource files, and specifically, the topic information may include: topic text content (stem, option, answer), difficulty, associated chapter, etc.; courseware information may include: courseware content (including text for each page), associated chapters, etc.; the video information includes: video content (containing subtitles or audio for each frame), associated chapters, and the like.
In order to characterize different types of resources, feature extraction needs to be performed on the different types of resources, and the feature information includes, but is not limited to, vectorized text features. For example, in the embodiment of generating the lesson preparation package, since text information of the subjects, the courseware and the videos is rich, text features of the subjects, the courseware and the videos need to be extracted, for video resources without subtitles, the text features need to be extracted after audio in the video resources is converted into the texts, and different types of resources are represented through the obtained text features.
Step S102, clustering a plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters.
Resource clusters can be understood as combinations of resources with similar contents. After the vectorized characteristic information of each resource file is obtained, clustering can be carried out by using the vectors to obtain resource combinations with similar contents. For example, in the above embodiment of generating a lesson preparation package, if the clustering range is limited to be within a chapter, the resource files in the same chapter are clustered according to the text features of the resource files, and the resources with similar contents are clustered together to obtain a plurality of resource clusters, where each resource cluster contains different types of resource files such as titles, courseware, videos, and the like.
Step S103, extracting resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
A package of files may be understood as a combination of resource files that a user expects to obtain. The resource file extraction request is input by a user and can comprise key words of the resource file, the number of various types of resources and the like, at least one resource cluster matched with the key words is returned by matching the key words in the resource file extraction request with the key words in the resource clusters, the resource file is extracted from each returned resource cluster, and the resource files of different types extracted from each resource cluster are combined to form a file package. In the embodiment of generating the lesson preparation package, the teacher can use the resource content keywords, the number of various types of resources and the topic difficulty level as the content of the resource file request, return to a plurality of most relevant resource clusters by matching the keywords input by the teacher and the keywords of the resource clusters, randomly select a specified number of topics, courseware and videos in each resource cluster, and finally generate the lesson preparation package.
In an alternative embodiment of generating a lesson preparation package for teaching, resource information including topics, courseware and videos is first obtained, where the topic information includes topic text content (stem, option and answer), difficulty and associated chapters, the courseware information includes courseware content (including text of each page), associated chapters, and the video information includes video content (including caption or audio of each frame), associated chapters and the like. Text features of the resource information are constructed (namely keywords of topics, courseware and videos are extracted), and vectorization processing is carried out on the text features to obtain text vectors. After the text vectors of the resource information are obtained, the vectors can be used for clustering, resource files with similar contents are gathered in the same resource cluster, for example, when the clustering range is set as a chapter, the resources under the same chapter can be clustered according to the chapter label of the resources, and the resources under the same chapter are gathered together. The teacher inputs the resource file extraction request and generates a lesson preparation package, for example, if the extraction request input by the teacher is a video file and a topic about a certain topic in a certain chapter, the corresponding number of video files and topics are selected from a corresponding topic resource cluster in the chapter to generate the lesson preparation package.
In this embodiment, a file package is formed by constructing feature information of a resource file, clustering a plurality of resource files to obtain a resource cluster, and obtaining a combination of a plurality of resources from the resource cluster according to a resource file extraction request. The resource file combination method can be used for generating lesson preparation packages for teaching, multi-resource combinations which are similar in content and suitable for matched use can be generated by constructing text features of resource files in the education field and correspondingly clustering, teachers are helped to quickly establish lesson preparation packages meeting requirements, time for teachers to search resources and match different types of resources is reduced, the technical problem that teachers cannot accurately find suitable teaching resources due to a single teaching resource recommendation method in the prior art is solved, and teaching efficiency is improved.
As an alternative embodiment, acquiring a plurality of resource files and constructing feature information of each resource file includes: acquiring text information in a resource file, and segmenting the text information; cleaning the word segmentation result by using the stop word list; and performing text vectorization processing based on the cleaning result to obtain a text vector for representing the characteristic information.
Since the text features can be obtained through a bag-of-words model or a word embedding model, both of which are constructed with text features with word granularity, the text needs to be segmented and cleaned before. The above word segmentation may be understood as splitting the text information into word units having a meaning, and the cleaning may be understood as filtering out stop words in the obtained multiple word units, for example, the side length of an equilateral triangle, after the word segmentation, obtaining three word units of "the equilateral triangle", "the" side length ", and" the "being a functional word (i.e., stop word) without an actual meaning, and after cleaning the analysis result, retaining two word units of" the equilateral triangle "and" the side length ", and performing vectorization processing on the two word units.
In an alternative embodiment, the cleaned text may be vectorized using TF-IDF (Term Frequency-Inverse text Frequency index) or Word2vec (Word to Vector model).
As an optional embodiment, in the case that the resource file is a video file, acquiring text information in the resource file includes: under the condition that the video file comprises subtitle data, acquiring the subtitle data to obtain text information in the video file; in the case where the video file does not include subtitle data, voice information in the video file is extracted and converted into text information.
It should be noted that, when the feature information of the resource file is represented by a text vector, different types of resources need to be converted into the text vector. For video resources without subtitles, voice information in the video resources needs to be converted into characters, vectorized text features are extracted, and different types of resources are represented through the obtained text features.
As an alternative embodiment, the method further includes: creating a deactivation word list corresponding to the file type of the resource file, wherein creating the deactivation word list corresponding to the file type of the resource file comprises: performing word segmentation on a total number of resource files in a resource library, wherein the resource library comprises a plurality of types of resource files; screening out stop words corresponding to each type of resource files from word segmentation results of the full amount of resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files; and generating a deactivation word list corresponding to the file type according to the deactivation words corresponding to each type of resource file.
It should be noted that in the field of education, different types of document resources (for example, titles, courseware and videos are different types of resource files) often contain general text descriptions with low information content, such as "the following statement" in titles, "the target of this lesson" in courseware, "the" reason "in videos, and general stop words cannot cover these words, so that a special stop word list needs to be constructed for the resources in the field of education. Due to different text expression styles of titles, courseware and videos, the inactive word lists of different types of resource files need to be separately constructed.
Taking courseware as an example, extracting texts of all courseware in a resource library, then performing word segmentation, counting the frequency of each word after word segmentation after the general stop word list is cleaned, and screening out words with higher frequency as new stop words so as to construct a special stop word list in the field of education resources. The construction methods of the stop word lists of different types of resource files can be the same, and the construction methods of other types of stop words such as titles, videos and the like are the same.
As an alternative embodiment, the cleaning of the word segmentation result by deactivating the vocabulary includes: and cleaning the word segmentation result through a stop word list corresponding to the file type of the resource file.
As an optional embodiment, after performing text vectorization processing based on the preprocessing result to obtain a text vector for representing the feature information, the method further includes one or more of: carrying out scaling processing on the text vector through an activation function; and performing dimension reduction processing on the text vector.
Since nouns related to knowledge points can repeatedly appear in courseware and videos, for example, the word of the intersection line can repeatedly appear in different pages in the courseware for explaining the intersection line, the number of the participles in the courseware is large, and the direct extraction of text features of the intersection line can cause that the weight of other characters is too low and the influence is small. Therefore, after the text vector is extracted, an element-wise scaling needs to be performed by using an activation function, and each dimension value of the text vector is limited to 0-1. For example, the activation function may be a sigmoid function, whose formula is as follows:
Figure BDA0002869304280000081
where x is a text vector.
In addition, the directly extracted text vectors are sparse in high dimension, so that different types of text vectors (e.g., topics, courseware, videos) can be uniformly reduced in dimension, and the dimension reduction method can be PCA (Principal Components Analysis), Isomap (Isometric Feature Mapping), T-SNE (T-distributed stored neighboring embedded algorithm), and the like.
As an optional embodiment, clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters includes: and clustering the plurality of resource files based on the characteristic information of each resource file through a K-means clustering algorithm to generate a plurality of resource clusters.
The K-means clustering algorithm, namely the K-means algorithm, divides samples into different clusters through the distance between the samples, and optimizes the clustering effect through iterative centroid. Through clustering based on the vector characteristics of the resource files, resources with similar contents can be clustered together to form a plurality of resource clusters.
As an optional embodiment, after clustering the plurality of resource files based on the characteristic information of each resource file to generate a plurality of resource clusters, the method further includes: receiving a newly added resource file and constructing the characteristic information of the newly added resource file; determining a neighbor file of the newly added resource file according to the characteristic information of the newly added resource file and the characteristic information of the existing resource file; and dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to the distance relationship between the newly added resource file and the adjacent file.
For the database in the education field, a large number of new resources (such as newly uploaded questions, courseware, videos and the like) are put in storage every day, the newly put resources can be added into a proper resource cluster on the basis of an original cluster by adopting an incremental clustering method, and the original cluster can be a plurality of resource clusters obtained through a K-means algorithm or a newly-built cluster. The new resources are classified by the incremental clustering method, so that the resource clustering efficiency can be improved, and the time for processing resource data is saved.
As an optional embodiment, in a case that the neighboring files all belong to the same first target resource cluster, dividing the newly added resource into any one of the resource clusters according to a distance relationship between the newly added resource file and the neighboring files, or regenerating a resource cluster for the newly added resource, includes: acquiring a first distance between the newly added resource file and the mass center of the first target resource cluster; acquiring a second distance between a resource file which is farthest from the centroid in the first target resource cluster and the centroid; acquiring the average distance between all resource files in the first target resource cluster and the centroid; under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance, dividing the newly added resource into a first target resource cluster; and under the condition that the difference between the first distance and the second distance is larger than the average distance, regenerating a resource cluster for the newly added resource.
Specifically, a text vector is extracted for each newly-warehoused resource, the distance between the new resource and all clustered resources is calculated, and k neighbors of the new resource are obtained. If all the k neighbors belong to a first target resource cluster (namely, the same cluster), calculating the distance dist _ c (namely, a first distance) between the new resource and the centroid of the cluster, the average distance dist _ mean between all samples of the cluster and the centroid, and the distance dist _ max (namely, a second distance) between the farthest resource file in the cluster and the centroid, and dividing the new resource into the first target resource cluster or the new cluster when determining the new resource according to the following two conditional formulas:
a) if dist _ c-dist _ max < (dist _ mean), namely the distance from the new resource to the centroid is not too large, dividing the new resource into a first target resource cluster;
b) if dist _ c-dist _ max > dist _ mean, i.e., the new resource is too far from the centroid, a new cluster is created for the new resource alone.
As an optional embodiment, in a case that the neighboring files do not belong to the same resource cluster, dividing the newly added resource into any one of the resource clusters according to a distance relationship between the newly added resource file and the neighboring files, or regenerating a resource cluster for the newly added resource, includes: acquiring the occupation ratio of a resource cluster to which a neighbor file belongs; under the condition that a second target resource cluster and a third target resource cluster exist, the proportion difference of which is smaller than a first preset value, a first average value of the distance between the newly added resource and a neighbor file belonging to the second target resource cluster, a second average value of the distance between the newly added resource and a neighbor file belonging to the third target resource cluster, a third average value of the distance between neighbor files in the second target resource cluster and a fourth average value of the distance between neighbor files in the third target resource cluster are obtained; if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and a neighbor file in the second target resource cluster into a third target resource cluster, wherein the occupation ratio of the third target resource cluster is higher than that of the second target resource cluster, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than a third average value, and the second average value is smaller than a fourth average value; and if the first average value, the second average value, the third average value and the fourth average value do not meet the preset conditions, acquiring a resource cluster to which a centroid with the shortest distance to the newly added resource file belongs, and adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
Specifically, if the k neighboring files do not belong to the same cluster, the occupation ratio of the neighboring files in the resource cluster to which the neighboring files belong is calculated, and if the neighboring files belong to m resource clusters, the occupation ratio of the neighboring files in each resource cluster to which the neighboring files belong is recorded as f1,f2,...,fmWherein f is1+f2+...+fmAnd judging which resource cluster the new resource is added to according to the following conditions:
if present, | fi-fjI < f, i not equal to j, wherein fiFj is the ratio of neighboring files in the second target resource cluster i and the third target resource cluster j, f is the first preset value, and the first preset value can be understood as the ratio threshold of the resource cluster i and the resource cluster j when the ratio threshold is in the rangeIf the ratio difference is smaller than the threshold, the occupation ratio of the neighboring files in the resource i and the resource cluster j is equivalent. The average distance d _ i _ in between the new resource and the neighboring files in the resource cluster i (i.e., the first average value), the average distance d _ j _ in between the new resource and the neighboring files in the resource cluster j (i.e., the second average value), the average distance d _ i _ mean between the neighboring files included in the resource cluster i (i.e., the third average value), and the average distance d _ j _ mean between the neighboring files included in the resource cluster j (i.e., the fourth average value) are calculated.
a) If preset conditions | d _ i _ in-d _ j _ in | < d, d _ i _ in < d _ i _ mean, and d _ j _ in < d _ j _ mean are met, wherein d is the second preset value, the second preset value is understood as a threshold value of the difference between d _ j _ in and d _ i _ in, and when the difference between d _ j _ in and d _ i _ in is smaller than the threshold value, the distance i between the new resource and the adjacent file of the resource cluster is equivalent to the distance i between the new resource and the adjacent file of the resource cluster j and the distance between the new resource and the adjacent file of the resource cluster j is smaller, and after the resource i and the resource cluster j are combined, the new resource is divided into the cluster after the resource i and the resource cluster j are combined. As an alternative embodiment, the neighbor files in the lower-priority resource cluster may be added to the higher-priority resource cluster together with the new resources.
b) If the preset conditions are not met, calculating the distance between the new resource and the centroid of the resource cluster where the neighbor file is located, determining the resource cluster with the shortest centroid distance, taking the resource cluster as a possible added alternative cluster, and judging whether the new resource is added into the alternative cluster or a new resource cluster according to the condition that the neighbor files all belong to the same first target resource cluster (at this time, the alternative cluster is taken as the first target resource cluster).
As an alternative embodiment, each resource file has a corresponding file rank, each resource cluster has a corresponding topic, and according to a received resource file extraction request, extracting resource files from at least one resource cluster to form a file package, and returning the file package includes: the resource file extracts request information in the request, wherein the request information comprises at least one of the following items: extracting a theme, extracting a file grade and extracting quantity corresponding to each file type; screening out resource files meeting the extracted file grade from the resource clusters same as the extracted theme under the condition that the resource file extraction request comprises the extracted theme and the extracted file grade; and under the condition that the resource file extraction request comprises the extraction quantity, randomly extracting the resource files corresponding to the extraction quantity from the resource files conforming to the extraction file grade to form a file package, and returning the file package.
It should be noted that each resource cluster contains different types of resources grouped together according to content similarity, so that when a file package is generated, resource files can be preferentially selected from one resource cluster, so as to ensure the connectivity of different resource files.
In the embodiment of generating a lesson preparation package for a teacher in the field of education, the extracted subjects can be keywords of topics, courseware and videos, and the extracted file grades can be difficulty of the topics, for example, the teacher inputs request information for generating the lesson preparation package, including the keywords of the topics, the courseware and the videos, the respective file quantities of the topics, the courseware and the videos, and the topic difficulty grades, returns a plurality of related resource clusters according to the keywords and the file quantities, filters out the topics which do not meet the requirement of the difficulty grades for each resource cluster, and then randomly selects resource files with the specified file quantity from the resource clusters to form the lesson preparation package.
As an optional embodiment, in the case that the resource file extraction request does not include the extraction quantity, determining the extraction quantity according to the historical extraction behavior of the extraction subject, or extracting the resource file according to the preset extraction quantity; under the condition that the resource file extraction request does not include the extraction subject, extracting resource files from the first N resource clusters with the resource file quantity sorted from high to low; and under the condition that the resource file extraction request does not comprise the extraction file grade, randomly extracting resource files corresponding to the extraction quantity from the resource clusters with the same extraction subjects to form a file package.
It should be noted that the request information may only include one or two of the extraction subject, the extraction file level, and the extraction number corresponding to each file type, for example, when a teacher inputs a request for generating a lesson preparation package, only the keyword and the difficulty level of the subject are input, and the resource file number in the lesson preparation package may be determined according to the resource file number in the lesson preparation package request input by the teacher in the past, or according to the preset extraction number. In an alternative embodiment, m resource clusters with the most relevant content can be returned by matching the keywords input by the teacher with the keywords of the resource clusters, where m is the preset number of returned resource clusters. If the teacher does not input the keywords, the resource clusters can be sorted according to the popularity degree of the resource clusters, and m resource clusters which are sorted at the top are returned. And filtering out the subjects which do not meet the difficulty requirement for each resource cluster, and randomly selecting a specified number of subjects, courseware and videos to generate the lesson preparation package.
Fig. 2 is a flowchart of an alternative file processing method according to an embodiment of the present invention, and as shown in fig. 2, the method includes:
step S201, acquiring resource information; the resources processed include, but are not limited to, topics, courseware, videos, etc., and the desired topic information includes: topic text content (stem, option, answer), difficulty, associated chapter. The required courseware information comprises: courseware content (containing text for each page), associated chapters. The desired video information includes: video content (including subtitles or audio for each frame), associated chapters.
Step S202, resource characteristics are constructed; and vectorizing the characteristics of the resources, particularly the text characteristics.
Step S203, clustering resources; after the vector representation of each resource is obtained, the vector can be used for clustering, for example, when the clustering range is set to be within a chapter, the resources under the same chapter can be clustered according to the chapter label of the resource, and the resources with similar contents are clustered together.
Step S204, a lesson preparation package is generated.
Fig. 3 is a flowchart of optional feature information for constructing a resource file according to an embodiment of the present invention, and as shown in fig. 3, the method includes:
step S301, updating a stop word list; taking courseware as an example, extracting texts of a full amount of courseware in a resource library, then performing word segmentation, counting the frequency of each word after word segmentation after the general stop word list is cleaned, and screening new stop words from the words with higher frequency to expand the stop word list in the field of education resources.
Step S302, segmenting words to stop words; and removing stop words after word segmentation of each resource file.
Step S303, vectorizing the text; and vectorizing the participles after the stop words are removed by using TF-IDF or word2 vec.
And S304, compressing the vector, performing element-wise scaling by using an activation function, and limiting each dimension value of the text vector to be 0-1.
Through the steps, the stop word list applicable to the subjects, courseware and videos in the education field is constructed, and the features of various resources are extracted in the same space so as to represent the resources of different types. According to the embodiment, a clustering mode is used for finding out resources with strong correlation, resource combinations which are similar in content and suitable for matching use are found out more efficiently, new resources are counted into suitable resource clusters by using an incremental clustering method for a large amount of new resources generated every day, a teacher can be quickly helped to build a lesson preparation package meeting requirements, the time for the teacher to find out the resources and match the resources of different types is shortened, and the teaching efficiency is improved.
Example 2
According to an embodiment of the present application, an embodiment of a document processing apparatus is provided, and fig. 4 is a schematic diagram of a document processing apparatus implemented according to the present invention, as shown in fig. 4, including: an obtaining module 41, configured to obtain a plurality of resource files and construct feature information of each resource file; a clustering module 42, configured to cluster the plurality of resource files based on the feature information of each resource file, so as to generate a plurality of resource clusters; and a forming module 43, configured to extract the resource file from the at least one resource cluster to form a file package according to the received resource file extraction request, and return the file package.
As an optional embodiment, the obtaining module includes: the first word segmentation sub-module is used for acquiring text information in the resource file and segmenting words of the text information; the cleaning submodule is used for cleaning the word segmentation result by using the disabled word list; and the vectorization submodule is used for carrying out text vectorization processing on the basis of the cleaning result to obtain a text vector for expressing the characteristic information.
As an optional embodiment, in the case that the resource file is a video file, the obtaining module includes: the caption extraction module is used for acquiring caption data under the condition that the video file comprises caption data to obtain text information in the video file; and the voice conversion sub-module is used for extracting the voice information in the video file and converting the voice information into text information under the condition that the video file does not comprise subtitle data.
As an alternative embodiment, the apparatus further comprises: the deactivation word list creating submodule is used for creating a deactivation word list corresponding to the file type of the resource file, wherein the creating of the deactivation word list corresponding to the file type of the resource file comprises the following steps: the second word segmentation submodule is used for segmenting full resource files in a resource library, wherein the resource library comprises multiple types of resource files; the screening submodule is used for screening out stop words corresponding to each type of resource files from the word segmentation results of the full resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files; and the stop word list generating submodule is used for generating a stop word list corresponding to the file type according to the stop words corresponding to each type of resource file.
As an optional embodiment, the cleaning sub-module is further configured to clean the word segmentation result through a stop word table corresponding to the file type of the resource file.
As an alternative embodiment, the apparatus further comprises one or more of the following: the scaling submodule is used for scaling the text vector through an activation function; and the dimension reduction submodule is used for carrying out dimension reduction processing on the text vector.
As an optional embodiment, the clustering module is further configured to cluster the plurality of resource files based on the feature information of each resource file through a K-means clustering algorithm to generate a plurality of resource clusters.
As an optional embodiment, the apparatus further includes: the first newly added module is used for receiving the newly added resource file and constructing the characteristic information of the newly added resource file; the neighbor determining submodule is used for determining neighbor files of the newly added resource files according to the characteristic information of the newly added resource files and the characteristic information of the existing resource files; and the first dividing module is used for dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to the distance relationship between the newly added resource file and the adjacent file.
As an alternative embodiment, in a case that neighboring files all belong to the same first target resource cluster, the first partitioning module further includes: the first distance obtaining submodule is used for obtaining a first distance between the newly added resource file and the mass center of the first target resource cluster; the second distance acquisition submodule is used for acquiring a second distance between a resource file which is farthest from the centroid in the first target resource cluster and the centroid; the average distance obtaining submodule is used for obtaining the average distance between all resource files in the first target resource cluster and the centroid; the second division submodule is used for dividing the newly added resource into the first target resource cluster under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance; and the second newly-added sub-module is used for regenerating a resource cluster for the newly-added resource under the condition that the difference between the first distance and the second distance is greater than the average distance.
As an alternative embodiment, in a case that neighboring files do not belong to the same resource cluster, the first partitioning module further includes: the occupation ratio obtaining submodule is used for obtaining the occupation ratio of the resource cluster to which the neighbor file belongs; the average value obtaining sub-module is used for obtaining a first average value of the distances between the newly added resource and the neighbor files belonging to the second target resource cluster, a second average value of the distances between the newly added resource and the neighbor files belonging to the third target resource cluster, a third average value of the distances between the neighbor files in the second target resource cluster and a fourth average value of the distances between the neighbor files in the third target resource cluster under the condition that the second target resource cluster and the third target resource cluster exist, wherein the ratio difference of the second target resource cluster and the third target resource cluster is smaller than a first preset value; the first adding sub-module is used for adding the newly added resource file and the neighbor file in the second target resource cluster into a third target resource cluster if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, wherein the occupation ratio of the third target resource cluster is higher than that of the second target resource cluster, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than a third average value, and the second average value is smaller than a fourth average value; and the second adding sub-module is used for acquiring a resource cluster to which the centroid with the shortest distance to the newly added resource file belongs if the first average value, the second average value, the third average value and the fourth average value do not meet the preset condition, and adding the newly added resource file into the determined resource cluster or regenerating a resource cluster for the newly added resource.
As an alternative embodiment, each resource file has a corresponding file rank, and each resource cluster has a corresponding topic, and the above-mentioned forming module includes: the extracting submodule is used for extracting request information in the resource file extracting request, wherein the request information comprises at least one of the following items: extracting a theme, extracting a file grade and extracting quantity corresponding to each file type; the first selection submodule is used for screening out resource files meeting the extracted file grade from the resource clusters same with the extracted theme under the condition that the resource file extraction request comprises the extracted theme and the extracted file grade; and the first selection submodule is used for randomly extracting the resource files corresponding to the extraction quantity from the resource files conforming to the extraction file grade to form a file package under the condition that the extraction quantity is included in the resource file extraction request, and returning the file package.
As an alternative embodiment, the above-mentioned constituent modules further include: the third selection submodule is used for determining the extraction quantity according to the historical extraction behavior of the extraction main body or extracting the resource files according to the preset extraction quantity under the condition that the extraction quantity is not included in the resource file extraction request; the fourth selection submodule is used for extracting the resource files from the first N resource clusters which are ordered from high to low in the number of the resource files under the condition that the resource file extraction request does not include the extraction theme; and under the condition that the resource file extraction request does not comprise the extraction file grade, randomly extracting resource files corresponding to the extraction quantity from the resource clusters with the same extraction subjects to form a file package.
In this embodiment, a file package is formed by constructing feature information of a resource file, clustering a plurality of resource files to obtain a resource cluster, and obtaining a combination of a plurality of resources from the resource cluster according to a resource file extraction request. The resource file combination method can be used for generating lesson preparation packages for teaching, multi-resource combinations which are similar in content and suitable for matched use can be generated by constructing text features of resource files in the education field and correspondingly clustering, teachers are helped to quickly establish lesson preparation packages meeting requirements, time for teachers to search resources and match different types of resources is reduced, the technical problem that teachers cannot accurately find suitable teaching resources due to a single teaching resource recommendation method in the prior art is solved, and teaching efficiency is improved.
Example 3
According to an embodiment of the application, an embodiment of a computer storage medium is provided, the computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any of the above. The method comprises the steps of obtaining a resource cluster by constructing characteristic information of resource files and clustering a plurality of resource files, and obtaining a combination of a plurality of resources from the resource cluster to form a file package according to a resource file extraction request. The resource file combination method can be used for generating lesson preparation packages for teaching, multi-resource combinations which are similar in content and suitable for matched use can be generated by constructing text features of resource files in the education field and correspondingly clustering, teachers are helped to quickly establish lesson preparation packages meeting requirements, time for teachers to search resources and match different types of resources is reduced, the technical problem that teachers cannot accurately find suitable teaching resources due to a single teaching resource recommendation method in the prior art is solved, and teaching efficiency is improved.
Example 4
According to an embodiment of the present application, there is provided an intelligent interactive tablet, including: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of the embodiment 1.
Fig. 5 is a schematic structural diagram of an intelligent interaction tablet according to an embodiment of the present application, where the intelligent interaction tablet includes the interaction device main body and the touch frame, and as shown in fig. 5, the intelligent interaction tablet 1000 may include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.
Wherein a communication bus 1002 is used to enable connective communication between these components.
The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.
Processor 1001 may include one or more processing cores, among other things. The processor 1001, using various interfaces and lines to connect various parts throughout the smart interaction tablet 1000, performs various functions of the smart interaction tablet 1000 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005, as well as invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.
The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 5, the memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and an operating application of the smart interactive tablet.
In the intelligent interactive tablet 1000 shown in fig. 5, the user interface 1003 is mainly used to provide an input interface for a user to obtain data input by the user; and the processor 1001 may be configured to call an operation application of the smart interactive tablet stored in the memory 1005, and specifically perform any one of the operations in embodiment 1.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (15)

1. A method for processing a file, comprising:
acquiring a plurality of resource files and constructing characteristic information of each resource file;
clustering the resource files based on the characteristic information of each resource file to generate a plurality of resource clusters;
and extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request, and returning the file package.
2. The method of claim 1, wherein obtaining a plurality of resource files and constructing characteristic information for each of the resource files comprises:
acquiring text information in the resource file, and segmenting the text information;
cleaning the word segmentation result by using the stop word list;
and performing text vectorization processing based on the cleaning result to obtain a text vector for representing the feature information.
3. The method according to claim 2, wherein in the case that the resource file is a video file, acquiring text information in the resource file comprises:
under the condition that the video file comprises subtitle data, acquiring the subtitle data to obtain text information in the video file;
and under the condition that the video file does not comprise the subtitle data, extracting the voice information in the video file, and converting the voice information into text information.
4. The method of claim 2, further comprising: creating a deactivation word list corresponding to the file type of the resource file, wherein creating the deactivation word list corresponding to the file type of the resource file comprises:
performing word segmentation on a total number of resource files in a resource library, wherein the resource library comprises a plurality of types of resource files;
screening out stop words corresponding to each type of resource files from the word segmentation results of the full resource files, wherein the stop words corresponding to each type of resource files are determined according to the occurrence frequency of each stop word in each type of resource files;
and generating a deactivation word list corresponding to the file type according to the deactivation words corresponding to each type of resource file.
5. The method of claim 4, wherein cleaning the segmentation results by deactivating the vocabulary comprises: and cleaning the word segmentation result through a stop word list corresponding to the file type of the resource file.
6. The method of claim 2, wherein after performing text vectorization processing based on the cleaning result to obtain a text vector representing the feature information, the method further comprises one or more of:
performing scaling processing on the text vector through an activation function;
and performing dimension reduction processing on the text vector.
7. The method of claim 1, wherein clustering the plurality of resource files based on the characteristic information of each of the resource files to generate a plurality of resource clusters comprises:
and clustering the resource files based on the characteristic information of each resource file through a K-means clustering algorithm to generate the resource clusters.
8. The method of claim 1, wherein after clustering the plurality of resource files based on the characteristic information of each of the resource files to generate a plurality of resource clusters, the method further comprises:
receiving a newly added resource file and constructing the characteristic information of the newly added resource file;
determining a neighbor file of the newly added resource file according to the characteristic information of the newly added resource file and the characteristic information of the existing resource file;
and dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to the distance relationship between the newly added resource file and the adjacent file.
9. The method according to claim 8, wherein, in a case that the neighboring files all belong to the same first target resource cluster, dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to a distance relationship between the newly added resource file and the neighboring files comprises:
acquiring a first distance between the newly added resource file and the mass center of the first target resource cluster;
acquiring a second distance between a resource file which is farthest from the centroid in the first target resource cluster and the centroid;
acquiring the average distance between all resource files in the first target resource cluster and the centroid;
under the condition that the difference between the first distance and the second distance is smaller than or equal to the average distance, dividing the newly added resource into the first target resource cluster;
and under the condition that the difference between the first distance and the second distance is larger than the average distance, regenerating a resource cluster for the newly added resource.
10. The method according to claim 8, wherein in a case that the neighboring files do not belong to the same resource cluster, dividing the newly added resource into any one of the resource clusters or regenerating a resource cluster for the newly added resource according to a distance relationship between the newly added resource file and the neighboring files comprises:
acquiring the occupation ratio of the resource cluster to which the neighbor file belongs;
under the condition that a second target resource cluster and a third target resource cluster exist, the proportion difference of which is smaller than a first preset value, acquiring a first average value of the distances between the newly added resource and neighbor files belonging to the second target resource cluster, a second average value of the distances between the newly added resource and neighbor files belonging to the third target resource cluster, a third average value of the distances between neighbor files in the second target resource cluster and a fourth average value of the distances between neighbor files in the third target resource cluster;
if the first average value, the second average value, the third average value and the fourth average value meet preset conditions, adding the newly added resource file and a neighbor file in the second target resource cluster into a third target resource cluster, wherein the occupation ratio of the third target resource cluster is higher than that of the second target resource cluster, and the preset conditions include: the absolute value of the difference between the first average value and the second average value is smaller than a second preset value, the first average value is smaller than the third average value, and the second average value is smaller than the fourth average value;
and if the first average value, the second average value, the third average value and the fourth average value do not meet preset conditions, acquiring a resource cluster to which a centroid with the shortest distance to the newly added resource file belongs, and adding the newly added resource file into the determined resource cluster, or regenerating a resource cluster for the newly added resource.
11. The method of claim 1, wherein each resource file has a corresponding file rank, each resource cluster has a corresponding topic, and extracting resource files from at least one of the resource clusters to form a file package and returning the file package according to the received resource file extraction request comprises:
the resource file extracts request information in the request, wherein the request information comprises at least one of the following items: extracting a theme, extracting a file grade and extracting quantity corresponding to each file type;
screening out resource files meeting the extracted file grade from the resource clusters same as the extracted theme under the condition that the resource file extraction request comprises the extracted theme and the extracted file grade;
and under the condition that the resource file extraction request comprises the extraction quantity, randomly extracting resource files corresponding to the extraction quantity from the resource files conforming to the extraction file grade to form the file package, and returning the file package.
12. The method of claim 11,
under the condition that the resource file extraction request does not include the extraction quantity, determining the extraction quantity according to the historical extraction behavior of an extraction subject, or extracting the resource files according to the preset extraction quantity;
under the condition that the resource file extraction request does not include the extraction subject, extracting resource files from the first N resource clusters with the resource file quantity sorted from high to low;
and under the condition that the resource file extraction request does not comprise the extracted file grade, randomly extracting resource files corresponding to the extraction quantity from the resource cluster with the same extraction subject to form the file package.
13. A device for processing a document, comprising:
the acquisition module is used for acquiring a plurality of resource files and constructing the characteristic information of each resource file;
the clustering module is used for clustering the resource files based on the characteristic information of each resource file to generate a plurality of resource clusters;
and the composition module is used for extracting the resource files from at least one resource cluster to form a file package according to the received resource file extraction request and returning the file package.
14. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any of claims 1 to 12.
15. An intelligent interactive tablet, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 12.
CN202011602808.9A 2020-12-29 2020-12-29 File processing method and device Active CN112732867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011602808.9A CN112732867B (en) 2020-12-29 2020-12-29 File processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011602808.9A CN112732867B (en) 2020-12-29 2020-12-29 File processing method and device

Publications (2)

Publication Number Publication Date
CN112732867A true CN112732867A (en) 2021-04-30
CN112732867B CN112732867B (en) 2024-03-15

Family

ID=75610513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011602808.9A Active CN112732867B (en) 2020-12-29 2020-12-29 File processing method and device

Country Status (1)

Country Link
CN (1) CN112732867B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN108647244A (en) * 2018-04-13 2018-10-12 广东技术师范学院 The tutorial resources integration method of mind map form, network store system
CN109299315A (en) * 2018-09-03 2019-02-01 腾讯科技(深圳)有限公司 Multimedia resource classification method, device, computer equipment and storage medium
CN110929161A (en) * 2019-12-02 2020-03-27 南京莱斯网信技术研究院有限公司 Large-scale user-oriented personalized teaching resource recommendation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862070A (en) * 2017-11-22 2018-03-30 华南理工大学 Online class based on text cluster discusses the instant group technology of short text and system
CN108647244A (en) * 2018-04-13 2018-10-12 广东技术师范学院 The tutorial resources integration method of mind map form, network store system
CN109299315A (en) * 2018-09-03 2019-02-01 腾讯科技(深圳)有限公司 Multimedia resource classification method, device, computer equipment and storage medium
CN110929161A (en) * 2019-12-02 2020-03-27 南京莱斯网信技术研究院有限公司 Large-scale user-oriented personalized teaching resource recommendation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黎孟雄 等: "基于模糊聚类的教学资源自适应推荐研究", 中国远程教育, no. 7, pages 89 - 92 *

Also Published As

Publication number Publication date
CN112732867B (en) 2024-03-15

Similar Documents

Publication Publication Date Title
CN108509465B (en) Video data recommendation method and device and server
CN112015949A (en) Video generation method and device, storage medium and electronic equipment
CN110719518A (en) Multimedia data processing method, device and equipment
CN103052953A (en) Information processing device, method of processing information, and program
CN109344298A (en) Method and device for converting unstructured data into structured data
CN107577672B (en) Public opinion-based script setting method and device
CN111783712A (en) Video processing method, device, equipment and medium
US10127824B2 (en) System and methods to create multi-faceted index instructional videos
CN106354860A (en) Method for automatically labelling and pushing information resource based on label sets
US20240143684A1 (en) Information presentation method and apparatus, and device and medium
CN110990563A (en) Artificial intelligence-based traditional culture material library construction method and system
CN111813993A (en) Video content expanding method and device, terminal equipment and storage medium
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
CN115580758A (en) Video content generation method and device, electronic equipment and storage medium
CN111488813A (en) Video emotion marking method and device, electronic equipment and storage medium
CN110110218A (en) A kind of Identity Association method and terminal
CN114845149B (en) Video clip method, video recommendation method, device, equipment and medium
CN116051192A (en) Method and device for processing data
CN110516086B (en) Method for automatically acquiring movie label based on deep neural network
CN112732867B (en) File processing method and device
CN112333554B (en) Multimedia data processing method and device, electronic equipment and storage medium
CN115130453A (en) Interactive information generation method and device
JP6900334B2 (en) Video output device, video output method and video output program
CN114691853A (en) Sentence recommendation method, device and equipment and computer readable storage medium
CN111368553A (en) Intelligent word cloud picture data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant