CN116704508A

CN116704508A - Information processing method and device

Info

Publication number: CN116704508A
Application number: CN202310614640.0A
Authority: CN
Inventors: 薛云飞; 原鑫
Original assignee: Xinhuasan Intelligent Terminal Co ltd
Current assignee: Xinhuasan Intelligent Terminal Co ltd
Priority date: 2023-05-27
Filing date: 2023-05-27
Publication date: 2023-09-05

Abstract

The application provides an information processing method and device, wherein the method is applied to a terminal and comprises the following steps: acquiring a first image included in teaching software; if the first image comprises formula content, filling the formula content in the first image into a second image to obtain a third image; inputting the third image into an OCR model to obtain text contents included in the third image; inputting the text content into a NER model to obtain key information included in the text content and an image position of the key information in the third image; and highlighting the image position in the third image.

Description

Information processing method and device

Technical Field

The present application relates to the field of communications technologies, and in particular, to an information processing method and apparatus.

Background

With the continuous development of technology, digital education has become an important part of the education field, and more teachers give lessons by means of lesson giving software (e.g., PPT). Because the teaching software generally includes more characters, teachers often experience difficulties in explaining the content of the teaching software. Such as complex content, fixed format, missing key content, inefficient interaction, etc. In addition, conventional digital education presents many problems to students. For example, many text teachings tend to be distracting to students, learning effects are not ideal, and so on.

In a classroom, different teachers use different teaching software according to own habits. If the teaching software is used to extract the text information in the teaching software, the problems of non-uniform interface, incomplete coverage, high serial difficulty and the like may be faced. Therefore, how to extract and highlight the key information in the teaching software through the digital control and comprehensively explain the key content and the key words related in the class becomes an important problem of digital education.

In the existing process of extracting keywords and displaying based on teaching software, a terminal firstly identifies text characters in a picture to be detected to obtain an identification result; then, according to the target keyword characters, the terminal performs matching in the recognition result. If the target keyword character is matched, the terminal determines a target frame in which the keyword is located, and obtains the coordinate of the keyword in the target frame in a rule form; and finally, highlighting the keywords by the terminal.

However, the following defects are also exposed in the keyword positioning and displaying process: 1) Under the digital education scene, a large number of formulas can be embedded in teaching software, and the occurrence of the formulas causes interference to the positioning of the coordinates of the keywords, so that the positioning errors of the coordinates of the keywords are caused; 2) After acquiring the multi-line detection result, multi-line text recognition increases the time cost by multiple times; 3) When the same keywords appear in a plurality of lines of texts, all recognition results need to be traversed, and when a large number of nested words appear, extra time expenditure is also brought; 4) In the process of recognizing the text characters, the situation of text character recognition errors may occur, and if the target keyword characters are not matched, the analysis of the keywords will be affected.

Disclosure of Invention

In view of the above, the present application provides an information processing method and apparatus, which are used to solve the above problems occurring in the existing keyword positioning and displaying process; the teaching scene compatible with different teaching software is realized, the extensibility and the readability of teaching courseware are improved, and students can grasp important contents conveniently; meanwhile, the interactive experience of digital teaching is improved.

In a first aspect, the present application provides an information processing method, where the method is applied to a terminal, the method includes:

acquiring a first image included in teaching software;

if the first image comprises formula content, filling the formula content in the first image into a second image to obtain a third image;

inputting the third image into an OCR model to obtain text contents included in the third image;

inputting the text content into a NER model to obtain key information included in the text content and an image position of the key information in the third image;

and highlighting the image position in the third image.

In a second aspect, the present application provides an information processing apparatus, the apparatus being applied to a terminal, the apparatus comprising:

An acquisition unit configured to acquire a first image included in teaching software;

the filling unit is used for filling the formula content in the first image into a second image to obtain a third image if the formula content is included in the first image;

the first processing unit is used for inputting the third image into an OCR model to obtain text contents included in the third image;

the second processing unit is used for inputting the text content into the NER model to obtain key information included in the text content and an image position of the key information in the third image;

and the display unit is used for highlighting the image position in the third image.

In a third aspect, the application provides a network device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor to cause the processor to perform the method provided by the first aspect of the application.

Therefore, by applying the information processing method and the information processing device provided by the application, the terminal acquires the first image included in the teaching software; if the first image comprises formula content, the terminal fills the formula content in the first image into a second image to obtain a third image; the terminal inputs the third image into the OCR model to obtain text content included in the third image; the terminal inputs the text content into the NER model to obtain key information included in the text content and an image position of the key information in a third image; the terminal highlights the image position in the third image.

In this way, when the teaching software has formula content in the digital education scene, the extensibility and readability of teaching courseware are improved by covering the formula content and extracting key information and the position of the key information from the image covered by the formula content; meanwhile, the key information is highlighted, so that students can conveniently grasp key contents of a classroom, and the interactive experience of digital teaching is improved. The method solves a plurality of problems in the existing keyword positioning and displaying process. In the teaching process, the terminal can also be compatible with different teaching software, so that the problems of different teaching software obtaining teaching information interfaces, incomplete coverage and complex calling are solved.

Drawings

FIG. 1 is a flowchart of an information processing method according to an embodiment of the present application;

FIG. 2 is a block diagram of an information processing apparatus according to an embodiment of the present application;

fig. 3 is a hardware structure of a network device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the corresponding listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

The information processing method provided by the embodiment of the application is described in detail below. Referring to fig. 1, fig. 1 is a flowchart of an information processing method according to an embodiment of the present application. The method is applied to a terminal, which may be embodied as a cloud screen (having a display function, a screenshot function, a processing function supporting running various models, algorithms, an input/output function, etc.). The information processing method provided by the embodiment of the application can comprise the following steps.

Step 110, acquiring a first image included in teaching software;

specifically, in the teaching process, a teacher uses a terminal to teach knowledge content to students. The terminal displays teaching software used by the teacher, for example, a certain page in the PPT is currently displayed.

The terminal intercepts the current display PPT as a first image by utilizing the self screenshot function. It can be understood that when the terminal intercepts a certain page in the PPT, the currently displayed page should be completely intercepted as the first image. The terminal may store the first image in a buffer of the terminal.

Step 120, if the first image includes formula content, filling the formula content in the first image into a second image to obtain a third image;

specifically, according to the description of step 110, after the terminal acquires the first image, it identifies whether the first image includes formula content or not through a trained formula detection algorithm.

In one implementation, if the first image includes formula content, the terminal fills the formula content in the first image into the second image, and obtains the third image.

The formula detection algorithm can be used for locating the formula content included in the image. The formula content is carried by the formula detection box. The formula detection box may include a separate region or a nested region. If the formula detection frame and the characters are displayed in the same row, the formula detection frame belongs to a nesting area; if the formula detection box is displayed alone in a row, the formula detection box belongs to a separate area.

The terminal fills the formula content into the second image which is a white image, that is, the terminal fills the formula detection frame into the white image. The terminal may populate the formula detection box with white images in a variety of ways. For example, the formula detection box is overlaid with a white image; or, the pixel values of the pixel points included in the formula detection frame are adjusted.

130, inputting the third image into an optical character recognition (English: optical Character Recognition, abbreviated as OCR) model to obtain text content included in the third image;

specifically, according to the description of step 120, after the terminal obtains the third image (i.e., the formula content has been filled in as the second image), the terminal inputs the third image into the OCR model. And obtaining the text content included in the third image by the terminal through the OCR model.

Optionally, the terminal inputs the third image into the OCR model to obtain text content included in the third image, which specifically includes:

the terminal carries out binarization operation on the image, and the image after the binarization operation is input into a text detection network (English: differentiable Binarization Network, DBNet for short) algorithm based on differential binarization. The positions of the plurality of text detection boxes and the first confidence of each text detection box are output through a DBNet algorithm. A first number of confidences is selected from the first plurality of confidences. The first text detection boxes corresponding to the first number of confidence coefficients and the positions of the first text detection boxes are input into a convolutional recurrent neural network (English: convolutional Recurrent Neural Network, abbreviated as CRNN) algorithm, the coded text identification corresponding to each first text detection box and the second confidence coefficient of the coded text identification are output, and each group of coded text identification is used for representing the text content in the first text detection box. A second number of confidences is selected from the plurality of second confidences. And decoding the text identifiers with the second number of confidence coefficients by using the vocabulary file to obtain text contents included in the third image.

Further, the binarization operation refers to a process of mapping the RGB channel value of each pixel point in the image to 0 or 1. When the area where the pixel point is located is a white area, assigning 1; and when the area where the pixel point is located is a black area, assigning 0. The purpose of the terminal performing the binarization operation is to improve the segmentability of the text, i.e. the black areas in the image are all text contents.

The DBNet algorithm fuses a lightweight backbone feature extraction network (mobiletv 3, mobiletv 3 is a lightweight neural network designed for mobile devices and edge computing devices, and is mainly used for tasks such as image classification, object detection, face recognition, and the like), identifies images after binarization, and outputs positions of a plurality of text detection boxes and a first confidence of each text detection box. Confidence refers to the evaluation of the confidence level or confidence level of a target that the DBNet algorithm derives for each detected target box in the target detection task. Confidence is typically expressed as a probability value between 0 and 1. Where 1 indicates a high confidence, i.e., very confident that the object is contained within the box, and 0 indicates a low confidence, i.e., not very confident that the object is contained within the box.

Wherein the first number of confidence degrees may be in particular a number of confidence degrees exceeding a first confidence threshold. For example, through the DBNet algorithm, 10 text detection boxes and 10 first confidence levels are output. The first confidence threshold is 0.3, and a plurality of confidences exceeding 0.3 are obtained from 10 first confidences, for example, the number of the first confidences exceeding 0.3 is 7.

The CRNN algorithm is integrated with a lightweight feature extraction network, and the character features of each line of text in the first text detection box are extracted. The lightweight feature extraction network improves the original Mobilenetv3 network of hundred-degree flying oar character recognition (PaddleOCR, paddleOCR is an open-source OCR recognition tool library realized based on a deep learning technology, and has wide application in different scenes, such as document recognition, license plate recognition, print recognition and the like), replaces Bottleneck with RepVGGBlock, ensures the lightweight of the network, and improves the feature extraction capability of characters.

Botteleneck: also known as the bottleneck layer, uses a 1x1 convolutional neural network. It is called the bottleneck layer because it is shaped like a bottleneck, with a thinner middle. The Bottleneck structure is widely applied to model design in deep learning, and for example, a ResNet, inception classical model adopts the method to optimize model performance. The method combines a plurality of convolution layers, and controls the calculation complexity and model performance by adjusting the number of channels, so that the network can process input data more efficiently, and the calculation speed and the prediction accuracy are improved.

RepVGGBlock: repVGG is a novel convolutional neural network architecture that innovatively modifies traditional convolutional neural networks and achieves performance comparable to current most advanced models in multiple computer vision tasks. In RepVGG, repVGG Block is one of basic building blocks, and comprises a series of convolution layers and normalization layers (English: batch Normalization, abbreviated as BN), and meanwhile, the structural heavy parameterization idea is adopted to improve the performance of a network, so that the network can still keep good performance under the condition of larger depth.

Where text feature refers to some characteristic or attribute of the text. In OCR models, the usual methods for extracting text features include the following: 1) Character shape characteristics: features are extracted by analyzing shape information such as outlines, boundaries, etc. of characters. The features may include the height, width, number of strokes, curve shape, etc. of the character; 2) Character texture features: features are extracted by analyzing texture information of the character, for example, a gray distribution of the character, a texture direction. This feature can be used to distinguish texture differences between different characters; 3) Character statistics feature: features are extracted by counting pixel information within or around the character. For example, the pixel density, horizontal and vertical projection distribution, etc. of the character may be counted; 4) Character spacing features: features are extracted by analyzing the spacing information between characters. These features may be used to identify spaces between characters, join symbols, and the like. 5) Character projection characteristics: features are extracted from projected information of the characters in the horizontal and vertical directions. For example, horizontal and vertical projection histograms of the character may be computed to capture the projection pattern of the character; 6) Character specific features: some specific characters have unique features that can be extracted by specific attributes for those characters. For example, for a numerical character, the relative positional relationship of the upper half and the lower half may be considered to be extracted.

The above text feature extraction methods may be used in combination with each other. In practical application, suitable characteristics can be selected according to specific application scenes and requirements. Of course, more advanced text features may be further extracted and learned using machine learning or deep learning.

And (3) utilizing the encoder-decoder structure, performing encoding operation on the character features by using a CRNN algorithm, and outputting the encoded character identifiers corresponding to each first text detection box and the second confidence coefficient of the encoded character identifiers. Then, the coded character identification is decoded by combining a word list file (each row of Chinese characters or other characters, and the word list file can also be an ASCII code list) to output standard characters. In the decoding operation, the CRNN algorithm can capture the time sequence information of the characters by adopting a two-way Long Short-Term Memory (LSTM) method, and uses the Long-Short-Term Memory to carry out decoding operation on the character identifiers by adopting a mode of time classification (English Connectionist Temporal Classification, CTC) loss and the like, and outputs standard characters.

LSTM: the method is mainly applied to solving the problems of gradient elimination and gradient explosion existing in the traditional RNN, and can better process long sequence data. Compared to traditional recurrent neural networks, LSTM introduces three gates within the cell to control the flow of information, including an input gate, a forget gate, and an output gate. The gates control input signals in a mode of learning parameters through operation of an activation function, so that the LSTM can better store and utilize past states and information, and meanwhile, the current input information can still be effectively learned, and therefore the effect and training speed of the model are improved. LSTM has achieved good performance in the fields of speech recognition, natural language processing, etc.

CTC: in the text recognition task, the ratio of the number of missed characters to the total number of characters is determined when the model cannot correctly recognize all the characters due to the interval between the characters or the shape of the characters.

Wherein the second number of confidence degrees may be in particular a number of confidence degrees exceeding a second confidence threshold. For example, through the CRNN algorithm, the coded text identifiers corresponding to the 7 text detection boxes and the 7 second confidence levels are output. The second confidence threshold is 0.5, and a plurality of confidence degrees exceeding 0.5 are obtained from the 7 second confidence degrees, for example, the number of the plurality of first confidence degrees exceeding 0.5 is 4.

Step 140, inputting the text content into a NER model to obtain key information included in the text content and an image position of the key information in the third image;

specifically, according to the description of step 130, after obtaining the text content included in the third image, the terminal inputs the text content into the NER model. And obtaining key information included in the text content and the image position of the key information in the third image by using a named entity recognition (English: named Entity Recognition, NER for short) model.

Optionally, the terminal inputs the text content into the NER model to obtain the key information included in the text content and the image position of the key information in the third image, which specifically includes:

the terminal inputs the text content into the NER model to obtain key information included in the text content and a first position of the key information in the image; according to the first position of the first text content in the first text detection box and the first position of the key information in the image, the terminal calculates the second position of the key information in the first text detection box; and calculating the image position by the terminal according to the position of the first text detection box, the box attribute of the first text detection box and the second position.

In embodiments of the present application, the NER model may be embodied as a network of physical pointers. With the entity pointer network, entity information, i.e. key information, can be extracted. For example: the third image output by the OCR model comprises text content of ' the sum of a flat mode containing independent variables and constants ' which is analyzed by the function ', and then the value range or the maximum value of the function is determined according to the value range of the variables.

The entity pointer network outputs key information in the image, and a first location of the key information in the image (which may also be referred to as an absolute location of the key information in the image). The first position includes a start position and an end position. For example, after the text content is input, the result is: the { ' mathematical term [ [ ' function-resolving type ',1,5], [ ' independent variable ',10,12], [ ' flat mode ',14,16], [ ' constant ',18,19], [ ' variable ',27,28], [ ' value range ',30,33]. Degree ] }.

The first position is obtained by extracting key information from all texts in the image, wherein the key information comprises different labels. May be mathematical terms, characters, etc. Wherein "mathematical terms" are labels; "function resolution" is key information; "1,5" are the start position and the end position, respectively.

The second position (also referred to as the relative position of the key information in the current text detection box) can be obtained by simple mathematical operations on the first position. For example, in an image, the first line of sentences are: a history of relativistic development 0 12 3 4 5 6 7; the second line of sentences are: the relativity theory is known as 8 9 10 11 12 1314 15; the third sentence is: how the relativistic formula writes 16 17 18 19 20 21 22 23. Wherein reference numerals 0-23 denote absolute positions of each letter in the image. The absolute position of the relativity in the first row of sentences is 1-3; the absolute position of the "relativistic" in the second line of sentences is 12-14.

In the actual processing, the text detection box is in units of each line of sentences, that is, one text detection box. If, in the text detection box of the second line sentence, the first position of the key information "relativity" is: 12-14, the second position of the key information "relativistic" in the text detection box in the second line of sentences is: 4-6. (first position of key information minus first position of first letter in the text detection box, i.e., 12-8=4)

Assume that the position of the text detection box of the second line sentence is: upper left corner coordinates (30, 40), lower right corner coordinates (110,70). This can be achieved by: the box attribute of the text detection box of the second line sentence: a width of 110-30=80; the height is 70-40=30.

According to the position of the text detection frame, the frame attribute of the file detection frame and the second position, calculating the image position of the key information in the image as follows: upper left corner coordinates (70,40); lower right corner coordinates (100, 70).

Calculating the occupation length of each word according to the frame attribute (namely, the occupation of each word is (110-30)/8=10); from the second position (4-6) and the abscissa in the position (30, 40) of the text detection box, the abscissa of the image position is calculated as: 30+4×10=70; the ordinate of the image position inherits the ordinate in the position of the text detection box, i.e. 40. Thus, the upper left corner coordinates of the key information are obtained as (70,40). And the same is done; the lower right corner coordinates from which the key information is derived are (100, 70).

And 150, highlighting the image position in the third image.

Specifically, after obtaining the image position of the key information in the third image according to the description of step 140, the terminal highlights the image position in the third image according to the image position.

Optionally, the terminal highlights the image position in the third image, which specifically includes:

the terminal highlights the image location or the terminal circles the image location out of display.

Optionally, in the embodiment of the present application, after the terminal acquires the first image, through a trained formula detection algorithm, whether the first image includes formula content or not includes another implementation manner is identified.

In another implementation manner, if the formula content is not included in the first image, the terminal directly inputs the first image into the OCR model to obtain text content included in the first image; the terminal inputs the text content into the NER model to obtain key information included in the text content and an image position of the key information in the first image; the terminal highlights the image position in the first image.

The process of directly inputting the first image into the OCR model by the terminal to obtain the text content included in the first image, inputting the text content into the NER model by the terminal to obtain the key information included in the text content and the image position of the key information in the first image, and highlighting the image position in the first image by the terminal is the same as the processes in steps 130, 140 and 150, and will not be repeated here.

Optionally, in the embodiment of the present application, after the terminal highlights the image position in the third image, or after the terminal highlights the image position in the first image, the method further includes a text error correction process and an parsing process.

Specifically, using a text error correction algorithm, the terminal performs error correction processing on the key information; the terminal inputs the key information subjected to error correction processing into a natural language processing (English: natural Language Processing, abbreviated as NLP) model to obtain analysis of the key information; when receiving an instruction of clicking the key information input by a user, the terminal displays analysis of the key information around the image position, or the terminal suspends and displays the analysis of the key information on an upper layer of the first image or the third image.

The NLP model can be specifically an algorithm model for analyzing and processing a large amount of text data such as a text language, a GPT and the like, and generating accurate answers and interpretations.

Further, in the embodiment of the application, the error correction processing can be performed on the key information by using an unsupervised learning training data probability distribution mode and an edit distance-based mode. The probability distribution mode and the edit distance mode of the unsupervised learning training data will be briefly described.

Unsupervised learning training data probability distribution mode:

1) Data preparation: a large amount of text data is collected, including correct text and possibly erroneous text. The text data may come from a variety of sources, such as web pages, news articles, social media, and so forth.

2) Error injection: some errors are manually or automatically injected into the prepared text data to simulate a real error condition. Errors may be introduced by randomly replacing characters, deleting characters, inserting characters, etc.

3) Feature extraction: feature extraction is performed on the prepared text data to represent the features of each text. Common features include character-level n-gram features, word-level n-gram features, TF-IDF features, and the like.

4) Unsupervised learning: and training the text data represented by the features by using an unsupervised learning method such as a clustering algorithm, a dimension reduction algorithm or a generation model so as to learn the probability distribution of the text data.

5) Probability estimation: and carrying out probability estimation on the text data represented by the features by using an unsupervised model obtained through training. And according to the probability distribution of the learned text data, evaluating the rationality of the text data and performing error correction operation.

Editing distance mode:

edit Distance (also known as Levenshtein Distance) is a measure of the similarity between two strings. In text correction, edit distance can be used to measure the difference between the input text and the correct text to find possible error locations or incorrect operations.

The edit distance is defined as the minimum number of editing operations required to convert from one character string to another, including Insertion (Insertion), deletion (Deletion), and Substitution (Substitution).

For two strings s1 and s2, the edit distance can be converted from s1 to s2 by a series of editing operations. Each editing operation may be inserting a character, deleting a character, or replacing a character. The edit distance can be obtained by calculating the minimum number of editing operations required.

For example, consider the strings s1= "kitten" and s2= "typing". S1 can be converted to s2 by the following editing operation: replacing "k" with "s"; replacing "e" with "i"; "n" is followed by "g". Thus, 3 editing operations are required. Thus, the edit distance is 3.

In text correction, the edit distance is used to compare the similarity between the input text and the candidate correction text. And selecting the candidate with the smallest editing distance as the text after error correction by calculating the editing distance between the input text and each candidate error correction text.

For example, the key information is "international science and technology of the Boguo Asian peak meeting", after the above-mentioned unsupervised learning training data probability distribution mode and edit distance mode, the word with the highest similarity to the key information is determined to be the "doctor". At this time, the "blog" is corrected to "doctor".

In the embodiment of the application, all the key information in the third image or the first image can be identified and stored in one set, and then text error correction operation is uniformly performed on the key information in the set.

Therefore, by applying the information processing method provided by the application, the terminal acquires the first image included in the teaching software; if the first image comprises formula content, the terminal fills the formula content in the first image into a second image to obtain a third image; the terminal inputs the third image into the OCR model to obtain text content included in the third image; the terminal inputs the text content into the NER model to obtain key information included in the text content and an image position of the key information in a third image; the terminal highlights the image position in the third image.

Based on the same inventive concept, the embodiment of the application also provides an information processing device corresponding to the information processing method. Referring to fig. 2, fig. 2 is an information processing apparatus provided in an embodiment of the present application, where the apparatus is applied to a terminal, and the apparatus includes:

an acquisition unit 210 for acquiring a first image included in the teaching software;

a filling unit 220, configured to, if the first image includes formula content, fill the formula content in the first image into a second image, and obtain a third image;

a first processing unit 230, configured to input the third image into an OCR model, to obtain text content included in the third image;

a second processing unit 240, configured to input the text content into a NER model, to obtain key information included in the text content and an image position of the key information in the third image;

and a display unit 250, configured to highlight an image position in the third image.

Optionally, the first processing unit 230 is further configured to, if the formula content is not included in the first image, input the first image into an OCR model to obtain text content included in the first image;

The second processing unit 240 is further configured to input the text content into a NER model, so as to obtain key information included in the text content and an image position of the key information in the first image;

the display unit 250 is further configured to highlight an image position in the first image.

Optionally, the first processing unit 230 is specifically configured to,

performing binarization operation on the image, inputting the image subjected to the binarization operation into a DBNet algorithm, and outputting the positions of a plurality of text detection boxes and the first confidence coefficient of each text detection box;

selecting a first number of confidences from the plurality of first confidences;

inputting the first text detection boxes corresponding to the first number of confidence degrees and the positions of the first text detection boxes into a CRNN algorithm, outputting coded text identifiers corresponding to each first text detection box and second confidence degrees of the coded text identifiers, wherein the coded text identifiers are used for representing text contents in the first text detection boxes;

selecting a second number of confidences from the plurality of second confidences;

and decoding the text identifiers with the second number of confidence coefficients by using the vocabulary file to obtain text contents included in the image.

Optionally, the second processing unit 240 is specifically configured to,

inputting the text content into a NER model to obtain key information included in the text content and a first position of the key information in the image;

calculating a second position of the key information in the first text detection box according to a first position of the first text content in the first text detection box and a first position of the key information in the image;

and calculating the image position by using the position of the first text detection box, the box attribute of the first text detection box and the second position.

Optionally, the display unit 250 is specifically configured to highlight the image location, or to circle the image location for display.

Optionally, the apparatus further comprises:

an error correction unit (not shown in the figure) for performing error correction processing on the key information by using a text error correction algorithm;

a third processing unit (not shown in the figure) for inputting the key information after error correction processing into an NLP model to obtain analysis of the key information;

the display unit 250 is further configured to display, when an instruction of clicking the key information input by a user is received, resolution of the key information around the image position, or to hover and display resolution of the key information on an upper layer of the first image or the third image.

Therefore, by applying the information processing device provided by the application, the terminal acquires the first image included in the teaching software; if the first image comprises formula content, the terminal fills the formula content in the first image into a second image to obtain a third image; the terminal inputs the third image into the OCR model to obtain text content included in the third image; the terminal inputs the text content into the NER model to obtain key information included in the text content and an image position of the key information in a third image; the terminal highlights the image position in the third image.

Based on the same inventive concept, the embodiment of the present application also provides a network device, as shown in fig. 3, including a processor 310, a transceiver 320, and a machine-readable storage medium 330, where the machine-readable storage medium 330 stores machine executable instructions capable of being executed by the processor 310, and the processor 310 is caused by the machine executable instructions to perform the information processing method provided by the embodiment of the present application. The information processing apparatus shown in fig. 2 may be implemented by using a hardware structure of a network device as shown in fig. 3.

The computer readable storage medium 330 may include a random access Memory (in english: random Access Memory, abbreviated as RAM) or a nonvolatile Memory (in english: non-volatile Memory, abbreviated as NVM), such as at least one magnetic disk Memory. Optionally, the computer readable storage medium 330 may also be at least one storage device located remotely from the aforementioned processor 310.

The processor 310 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (English: digital Signal Processor; DSP; for short), an application specific integrated circuit (English: application Specific Integrated Circuit; ASIC; for short), a Field programmable gate array (English: field-Programmable Gate Array; FPGA; for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.

In an embodiment of the present application, processor 310, by reading machine-executable instructions stored in machine-readable storage medium 330, is caused by the machine-executable instructions to implement processor 310 itself and invoke transceiver 320 to perform the information processing methods described in the previous embodiments of the present application.

In addition, embodiments of the present application provide a machine-readable storage medium 330, the machine-readable storage medium 330 storing machine-executable instructions that, when invoked and executed by the processor 310, cause the processor 310 itself and the invoking transceiver 320 to perform the information processing methods described in the foregoing embodiments of the present application.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

For the information processing apparatus and the machine-readable storage medium embodiments, since the method contents involved are substantially similar to those of the foregoing method embodiments, the description is relatively simple, and reference should be made to the description of the method embodiments for relevant points.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims

1. An information processing method, wherein the method is applied to a terminal, the method comprising:

acquiring a first image included in teaching software;

and highlighting the image position in the third image.

2. The method according to claim 1, wherein the method further comprises:

if the first image does not contain formula content, inputting the first image into an OCR model to obtain text content contained in the first image;

inputting the text content into a NER model to obtain key information included in the text content and an image position of the key information in the first image;

and highlighting the image position in the first image.

3. The method according to any one of claims 1 or 2, wherein the inputting the third image into the OCR model to obtain the text content included in the third image, or the inputting the first image into the OCR model to obtain the text content included in the first image specifically includes:

4. The method according to claim 3, wherein the inputting the text content into the NER model to obtain the key information included in the text content and the image position of the key information in the third image, or the inputting the text content into the NER model to obtain the key information included in the text content and the image position of the key information in the first image, specifically includes:

5. A method according to claim 3, wherein the highlighting the image position in the third image or the highlighting the image position in the first image specifically comprises:

highlighting the image location or circling the image location for display.

6. A method according to claim 3, characterized in that the method further comprises:

performing error correction processing on the key information by using a text error correction algorithm;

inputting the key information subjected to error correction processing into an NLP model to obtain analysis of the key information;

when an instruction of clicking the key information input by a user is received, the analysis of the key information is displayed around the image position, or the analysis of the key information is displayed in suspension on the upper layer of the first image or the third image.

7. An information processing apparatus, the apparatus being applied to a terminal, the apparatus comprising:

8. The apparatus of claim 7, wherein the first processing unit is further configured to input the first image into an OCR model to obtain text content included in the first image if formula content is not included in the first image;

the second processing unit is further configured to input the text content into a NER model, to obtain key information included in the text content and an image position of the key information in the first image;

the display unit is further configured to highlight an image position in the first image.

9. The device according to any one of claims 8 or 9, wherein the first processing unit is specifically configured to,

10. The device according to claim 9, wherein the second processing unit is specifically configured to,

11. The device according to claim 9, wherein the display unit is in particular adapted to,

highlighting the image location or circling the image location for display.

12. The apparatus of claim 9, wherein the apparatus further comprises:

the error correction unit is used for carrying out error correction processing on the key information by utilizing a text error correction algorithm;

the third processing unit is used for inputting the key information subjected to error correction processing into an NLP model to obtain analysis of the key information;

the display unit is further configured to display, when an instruction of clicking the key information input by a user is received, analysis of the key information around the image position, or to suspend and display the analysis of the key information on an upper layer of the first image or the third image.