CN116189271B

CN116189271B - Data processing method and system based on intelligent watch identification lip language

Info

Publication number: CN116189271B
Application number: CN202310425186.4A
Authority: CN
Inventors: 单文豪
Original assignee: Shenzhen Manridy Technology Co ltd
Current assignee: Shenzhen Manridy Technology Co ltd
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-07-14
Anticipated expiration: 2043-04-20
Also published as: CN116189271A

Abstract

The invention provides a data processing method and a system based on intelligent watch lip language identification, which are applied to the field of data processing; the method comprises the steps of collecting facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, obtaining content to be recognized of the recognition area, analyzing the content to be recognized to obtain a transformation track of the recognition area, carrying out data synthesis on the transformation track based on dynamic transformation to generate lip language data of the user, extracting the voice data by applying a preset database, capturing language features appearing in the voice data, analyzing the lip language data based on the voice data and the language features to generate lip language processing data of the user, carrying out text translation on the lip language processing data, translating the lip language processing data based on a preset mapping relation to generate at least one or more language texts corresponding to the lip language processing data, and generating the language texts in a preset display screen.

Description

Data processing method and system based on intelligent watch identification lip language

Technical Field

The invention relates to the field of data processing, in particular to a data processing method and system for recognizing lip language based on an intelligent watch.

Background

Along with the continuous development of social productivity and science and technology, the requirements of various industries on language technology identification are increasingly greater, the language technology identification is a three-dimensional dynamic scene and entity behavior of multi-source information fusion and interaction, and the intelligent wearing equipment is adopted to record spoken language data when a user is in a noisy environment, but the function of lip language identification cannot be realized by using the intelligent wearing equipment, so that the function of voice identification is difficult to realize under the noisy environment, and communication disorder is easy to occur when the user interacts with the intelligent wearing equipment.

Disclosure of Invention

The invention aims to solve the problem that a user is difficult to realize a voice recognition function in a noisy environment and communication barriers are easy to occur when interacting with intelligent wearable equipment, and provides a data processing method and a data processing system based on intelligent watch lip language recognition.

The invention adopts the following technical means for solving the technical problems:

the invention provides a data processing method based on intelligent watch lip language identification, which comprises the following steps:

acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, and dynamically recognizing the recognition area by using the scanner to acquire the content to be recognized of the recognition area;

Inputting part information in the identification area into a training model for training to obtain a trained prediction model, analyzing the content to be identified to obtain a transformation track of the identification area, carrying out data synthesis on the transformation track based on dynamic transformation to generate lip language data of the user, inputting the lip language data into the prediction model for prediction to obtain voice data of the user, wherein the transformation track is specifically a transformation track of mouth action when the user speaks, and the part information comprises an upper lip, a lower lip, a lip angle and a lip valley;

extracting the voice data by using a preset database, and capturing language features appearing in the voice data, wherein the language features comprise sound production, languages and language branch dialects;

analyzing the lip language data based on the voice data and the language features, generating lip language processing data of the user, and judging whether the lip language processing data is clear or not according to a preset statement scoring table;

if yes, text translation is carried out on the lip processing data, translation is carried out on the lip processing data based on a preset mapping relation, at least one or more language texts corresponding to the lip processing data are generated, and the language texts are generated in a preset display screen, wherein the language texts are specifically language type texts arranged based on preset priorities.

Further, the step of acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, dynamically recognizing the recognition area by using the scanner, and acquiring the content to be recognized of the recognition area includes:

capturing the current scannable width and the scanning distance of the user;

judging whether the scannable width and the scanning distance are within a preset scanning range or not;

if yes, acquiring face information in the scanning range, comparing the face information with a preset scanning template in a difference way, generating a recognizable area of the face information, acquiring identity data corresponding to the face information according to the recognizable area, and acquiring lip fluctuation information in the recognizable area; wherein the identifiable region includes an ocular feature, a nasal feature, a lip feature, a brow feature, and an ear feature;

if the lip fluctuation information exists in the face information, recording the starting time and the ending time of the lip fluctuation information.

Further, the step of inputting the location information in the identification area into a training model to train and obtaining a trained prediction model includes:

Acquiring an initial training sample set, wherein training samples in the initial training sample set comprise image data of the position information and corresponding labels of the image data;

determining a clustering algorithm corresponding to the initial training sample set, taking the corresponding labels as cluster numbers, and clustering the image data by using the clustering algorithm to generate a clustering result, wherein the clustering result comprises corresponding labels of the image data, each cluster number comprises a plurality of image data samples, and the clustering algorithm comprises selecting K-means or spectral clustering;

determining image data, of which each corresponding label in the clustering result is inconsistent with the corresponding label, as an abnormal training sample;

deleting the abnormal training sample from the initial training sample set to obtain a cleaning training sample set, and inputting the cleaning training sample set into the training model for training.

Further, the step of analyzing the content to be identified to obtain a transformation track of the identification area and synthesizing data based on dynamic transformation of the transformation track includes:

based on the dynamic scene recorded in the content to be identified, constructing each reference axis taking the identification area as a reference coordinate system, and taking the position information as each coordinate of an independent point;

And extracting each reference axis and each coordinate to serve as an image to be converted, mapping the image to be converted into a preset space template according to a preset conversion proportion, generating a converted image corresponding to the image to be converted, and acquiring a dynamic conversion track of the converted image by applying a preset frame rate based on view angle information preset by the space template and the dynamic scene.

Further, the step of extracting the voice data by using a preset database and capturing language features appearing in the voice data includes:

extracting period data of at least one or more preset time points in the voice data;

and coding the time period data by using a preset coder, converting the time period data into voice feature vectors corresponding to each preset time point, and generating voice features corresponding to the voice data based on the voice feature vectors, wherein the coder specifically characterizes each vector of the voice data based on an embedded layer, and is based on each vector in the time period data set.

Further, the step of analyzing the lip language data based on the voice data and the language features to generate lip language processing data of the user and judging whether the lip language processing data is clear according to a preset sentence scoring table includes:

Converting the lip language processing data into text information in a preset format based on the language type;

inputting the text information into a preset reading model for reading, and generating sentence scores corresponding to the text information based on the text meanings of the text information;

judging whether the sentence score is larger than a score benchmark preset in the sentence score table;

if yes, judging that the lip language processing data has corresponding meaning.

Further, the step of acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, dynamically recognizing the recognition area by using the scanner, and acquiring the content to be recognized of the recognition area comprises the following steps:

capturing decibel data in an environment;

judging whether the decibel data is larger than a preset reading threshold value or not;

if yes, stopping capturing the decibel data in the environment, starting a preset scanner, and acquiring the image data in the preset range based on the temperature information in the preset range.

The invention also provides a data processing system based on the intelligent watch lip language identification, which comprises:

the acquisition module is used for acquiring facial image data of a user based on a preset scanner, generating an identification area to be captured according to the facial image data, and dynamically identifying the identification area by using the scanner to acquire the content to be identified of the identification area;

The prediction module is used for inputting the position information in the identification area into a training model for training to obtain a trained prediction model, analyzing the content to be identified to obtain a transformation track of the identification area, carrying out data synthesis on the transformation track based on dynamic transformation to generate lip language data of the user, inputting the lip language data into the prediction model for prediction to obtain voice data of the user, wherein the transformation track is specifically a transformation track of mouth action when the user speaks, and the position information comprises an upper lip, a lower lip, lip angles and lip valleys;

the capturing module is used for extracting the voice data by applying a preset database and capturing language features appearing in the voice data, wherein the language features comprise sound production, languages and language branch dialects;

the judging module is used for analyzing the lip language data based on the voice data and the language characteristics, generating lip language processing data of the user, and judging whether the lip language processing data is clear or not according to a preset statement scoring table;

and the execution module is used for translating the text of the lip processing data if the lip processing data is processed, translating the lip processing data based on a preset mapping relation, generating at least one or more language texts corresponding to the lip processing data, and generating the language texts in a preset display screen, wherein the language texts are particularly language type texts arranged based on a preset priority.

Further, the acquisition module further comprises:

the capturing unit is used for capturing the current scannable width and the scanning distance of the user;

the judging unit is used for judging whether the scannable width and the scanning distance are in a preset scanning range or not;

the execution unit is used for acquiring the face information in the scanning range if the face information is detected, comparing the face information with a preset scanning template in a difference way, generating an identifiable region of the face information, acquiring identity data corresponding to the face information according to the identifiable region, and acquiring lip fluctuation information in the identifiable region; wherein the identifiable region includes an ocular feature, a nasal feature, a lip feature, a brow feature, and an ear feature;

and the recording unit is used for recording the starting time and the ending time of the lip fluctuation information if the lip fluctuation information exists in the face information.

Further, the prediction module further includes:

the acquisition unit is used for acquiring an initial training sample set, wherein training samples in the initial training sample set comprise image data of the position information and corresponding labels of the image data;

The generating unit is used for determining a clustering algorithm corresponding to the initial training sample set, taking the corresponding labels as cluster numbers, and clustering the image data by utilizing the clustering algorithm to generate a clustering result, wherein the clustering result comprises corresponding labels of the image data, each cluster number comprises a plurality of image data samples, and the clustering algorithm comprises the steps of selecting K-means or spectral clustering;

the comparison unit is used for determining image data, of which each corresponding label in the clustering result is inconsistent with the corresponding label, as an abnormal training sample;

the training unit is used for deleting the abnormal training sample from the initial training sample set to obtain a cleaning training sample set, and inputting the cleaning training sample set into the training model for training.

The invention provides a data processing method and a system based on intelligent watch lip language identification, which have the following beneficial effects:

according to the invention, after lip image data of a user are collected, the lip image data are analyzed to obtain a transformation track, the transformation track is subjected to data synthesis to generate lip data of the user, the lip data are input into a prediction model to be predicted to obtain voice data of the user, language characteristics of the voice data are extracted, lip processing data of the user are generated through translation, language text information is obtained and then generated in a display screen preset in the intelligent watch, so that the user can still realize a voice recognition function in a noisy environment, and the probability of communication disorder when the user interacts with intelligent wearable equipment is reduced.

Drawings

FIG. 1 is a flow chart of an embodiment of a method for processing data based on a smart watch identification lip language according to the present invention;

fig. 2 is a block diagram illustrating an embodiment of a data processing system based on a smart watch identification lip language according to the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present invention, as the achievement, functional features, and advantages of the present invention are further described with reference to the embodiments, with reference to the accompanying drawings.

The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a data processing method based on a smart watch identification lip language according to an embodiment of the present invention includes:

s1: acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, and dynamically recognizing the recognition area by using the scanner to acquire the content to be recognized of the recognition area;

S2: inputting part information in the identification area into a training model for training to obtain a trained prediction model, analyzing the content to be identified to obtain a transformation track of the identification area, carrying out data synthesis on the transformation track based on dynamic transformation to generate lip language data of the user, inputting the lip language data into the prediction model for prediction to obtain voice data of the user, wherein the transformation track is specifically a transformation track of mouth action when the user speaks, and the part information comprises an upper lip, a lower lip, a lip angle and a lip valley;

s3: extracting the voice data by using a preset database, and capturing language features appearing in the voice data, wherein the language features comprise sound production, languages and language branch dialects;

s4: analyzing the lip language data based on the voice data and the language features, generating lip language processing data of the user, and judging whether the lip language processing data is clear or not according to a preset statement scoring table;

s5: if yes, text translation is carried out on the lip processing data, translation is carried out on the lip processing data based on a preset mapping relation, at least one or more language texts corresponding to the lip processing data are generated, and the language texts are generated in a preset display screen, wherein the language texts are specifically language type texts arranged based on preset priorities.

In this embodiment, the intelligent system collects facial image data of a user wearing the intelligent watch based on a preset scanner, generates a recognition region to be captured, namely a mouth region, according to the facial image data, and dynamically recognizes the mouth region by using the scanner to obtain voice content to be recognized of the mouth region; accordingly, the scene is summarized as follows: because the user is in a noisy environment, the intelligent watch cannot record voice content about the user through the capture volume, and therefore the user mouth area is scanned to obtain the voice content to be identified in the user mouth area; then the intelligent system takes corresponding position information (comprising an upper lip position, a lower lip position, a lip angle position and a lip valley position of a mouth region) in the recognition region as a training sample, inputs the training sample into a blank training model for training to obtain a prediction model after training, analyzes the voice content to be recognized to obtain a lip transformation track of the user corresponding to the voice content to be recognized, synthesizes data of the transformation track based on a dynamic transformation process to generate lip data corresponding to the voice content to be recognized by the user, and predicts the lip data by applying a prediction model to obtain voice data corresponding to the user; the intelligent system extracts the characteristics of the voice data of the user by applying a preset large database, and captures the language characteristics appearing in the voice data, including sounding characteristics, language types and branch dialects corresponding to the language types appearing in the voice data; the intelligent system analyzes the lip language data of the user based on the voice data and the language characteristics corresponding to the voice data to generate lip language processing data of the user, and then judges whether the lip language processing data is clear or not according to a statement scoring table preset so as to execute corresponding steps; for example, the system judges that the lip language processing data is not clear according to the sentence scoring table, namely the system judges that the meaning of the voice content to be recognized of the user is not clear at the moment, and the intelligent watch displays a preset prompt word of 'please repeat the voice content to be read' into a display screen of the watch so as to remind the user to repeat the voice content with the unclear meaning just, and the system carries out reading recognition again; for example, the system determines that the lip processing data is clear according to the sentence scoring table, that is, the system determines that the voice content to be recognized of the user has a corresponding meaning at the moment, the smart watch translates the meaning, translates the lip processing data based on a preset mapping relationship, generates at least one or more language texts corresponding to the translated text (for example, language text branches of translating text into chinese language such as mandarin, white language and Charactizing, language text branches of translating text into spanish such as gatway, bask and gatway Li Xiya), and finally generates the language texts in a display screen of the smart watch.

In this embodiment, acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, dynamically recognizing the recognition area by using the scanner, and acquiring the content to be recognized of the recognition area, where step S1 includes:

s11: capturing the current scannable width and the scanning distance of the user;

s12: judging whether the scannable width and the scanning distance are within a preset scanning range or not;

s13: if yes, acquiring face information in the scanning range, comparing the face information with a preset scanning template in a difference way, generating a recognizable area of the face information, acquiring identity data corresponding to the face information according to the recognizable area, and acquiring lip fluctuation information in the recognizable area; wherein the identifiable region includes an ocular feature, a nasal feature, a lip feature, a brow feature, and an ear feature;

s14: if the lip fluctuation information exists in the face information, recording the starting time and the ending time of the lip fluctuation information.

In this embodiment, the intelligent system captures the user width and the scanning distance in the frame through the set maximum width and the set maximum scanning distance that can be scanned by the scanner frame, and determines whether the user width and the scanning distance in the scanner frame are within the preset maximum scanner width and the preset maximum scanning distance, so as to execute the corresponding steps; for example, the width and the scanning distance of the user in the scanner picture are not within the preset maximum width and the preset maximum scanning distance of the scanner, that is, the intelligent watch cannot meet the scanning condition, the user cannot be confirmed to be in the scannable range, and the scanning function cannot be started in advance to scan the user; for example, the user width and the scanning distance in the scanner picture are within the preset maximum scanner width and the preset maximum scanning distance, that is, the intelligent watch can collect face information in the scanning range, and compare the face information differently based on the preset scanning template to generate identifiable areas (namely five sense organs data) of the face information, obtain identity data corresponding to the user according to face information data corresponding to the identifiable areas, collect lip fluctuation information of the user, and record the start time and the end time of the lip fluctuation information.

It should be noted that, the reason for comparing the difference between the face information and the scanned sample is as follows: the user is prevented from suddenly leaving the range of the scanner in the scanning process, so that the scanner can erroneously scan other error information with similar existing temperature as the face information.

In this embodiment, the step S2 of inputting the location information in the identification area into the training model to perform training, and obtaining the trained prediction model includes:

s21: acquiring an initial training sample set, wherein training samples in the initial training sample set comprise image data of the position information and corresponding labels of the image data;

s22: determining a clustering algorithm corresponding to the initial training sample set, taking the corresponding labels as cluster numbers, and clustering the image data by using the clustering algorithm to generate a clustering result, wherein the clustering result comprises corresponding labels of the image data, each cluster number comprises a plurality of image data samples, and the clustering algorithm comprises selecting K-means or spectral clustering;

s23: determining image data, of which each corresponding label in the clustering result is inconsistent with the corresponding label, as an abnormal training sample;

S24: deleting the abnormal training sample from the initial training sample set to obtain a cleaning training sample set, and inputting the cleaning training sample set into the training model for training.

In this embodiment, the system determines a clustering algorithm (including a K-means clustering algorithm and a spectral clustering algorithm) required to be adopted for cleaning an initial training sample set by acquiring the initial training sample set (including labeling content corresponding to image data and image data of part information of a mouth region) for training a blank training model, uses the corresponding labeling content as a cluster number, applies the clustering algorithm to cluster the image data to generate a clustering result, includes image data of lip angle, lip bow, lip peak, lip bead, lip valley, upper lip and lower lip of the part information, and then determines the image data which is inconsistent with the label based on the labeling information and label in the clustering result as abnormal training samples (that is, other data is redundant image data inconsistent with the corresponding label except the image data is lip angle, lip valley, upper lip and lower lip), and deletes the abnormal training samples of the redundant image data from the initial training sample set to obtain a cleaned training sample set, and inputs the training sample set as the training sample into the blank training model for training.

In this embodiment, the step S2 of analyzing the content to be identified to obtain a transformation track of the identification area and synthesizing the transformation track based on dynamic transformation includes:

s201: based on the dynamic scene recorded in the content to be identified, constructing each reference axis taking the identification area as a reference coordinate system, and taking the position information as each coordinate of an independent point;

s202: and extracting each reference axis and each coordinate to serve as an image to be converted, mapping the image to be converted into a preset space template according to a preset conversion proportion, generating a converted image corresponding to the image to be converted, and acquiring a dynamic conversion track of the converted image by applying a preset frame rate based on view angle information preset by the space template and the dynamic scene.

In this embodiment, the intelligent system constructs three reference axes X, Y, Z using the region to be recognized (i.e., the mouth region) as a reference coordinate system based on dynamic scene data belonging to the user obtained by recording in the voice content to be recognized, uses four coordinates of the upper lip, the lower lip, the lip angle and the lip valley of the part information as independent points, establishes a reference plane belonging to the mouth region by using the reference axes and the coordinate numbers, uses the reference plane as an image to be converted by extracting the three reference axes and the four coordinates in the reference plane, maps the image to be converted into a space template which is preset according to a preset proportional size, generates a converted image which has the same content as the image to be converted but has different proportional size, then uses a preset frame rate to acquire a dynamic conversion track of the converted image based on preset visual angle information and dynamic scene data in the space template, and finally acquires data of the beginning of the lip conversion of the user to data of the conversion end by the dynamic conversion track.

In this embodiment, the step S3 of extracting the voice data by using a preset database and capturing the language features appearing in the voice data includes:

s31: extracting period data of at least one or more preset time points in the voice data;

s32: and coding the time period data by using a preset coder, converting the time period data into voice feature vectors corresponding to each preset time point, and generating voice features corresponding to the voice data based on the voice feature vectors, wherein the coder specifically characterizes each vector of the voice data based on an embedded layer, and is based on each vector in the time period data set.

In this embodiment, the system converts the speech data of the predetermined time point into the speech input vector by extracting the period data of at least one or more predetermined time points in the predicted speech data, using the embedding layer of the encoder model, so as to obtain a sequence of the speech input vector, thereby being capable of facilitating the subsequent encoding process, and then, passes the sequence of the speech input vector through the converter of the encoder model to convert the speech data of each predetermined time point into the speech feature vector corresponding to each predetermined time point, it should be understood that since the speech input vector can be encoded based on context by the encoder model based on the converter, the obtained speech feature vector can obtain the speech associated information of a plurality of predetermined time points globally to generate the speech feature corresponding to the speech data.

In this embodiment, the step S4 of analyzing the lip language data based on the voice data and the language features to generate lip language processing data of the user and determining whether the lip language processing data is clear according to a preset sentence scoring table includes:

s41: converting the lip language processing data into text information in a preset format based on the language type;

s42: inputting the text information into a preset reading model for reading, and generating sentence scores corresponding to the text information based on the text meanings of the text information;

s43: judging whether the sentence score is larger than a score benchmark preset in the sentence score table;

s44: if yes, judging that the lip language processing data has corresponding meaning.

In this embodiment, the intelligent system converts the lip processing data into text information content in a preset format based on the language type, then inputs the text information content into a preset reading model for reading to obtain text meanings corresponding to the text information content, generates sentence scores corresponding to the text information based on the text meanings, and judges whether the sentence scores are greater than a score standard preset in a sentence score table so as to execute corresponding steps; for example, if the sentence score generated by the system based on the text meanings is 70 points and the score reference preset by the sentence score table is 60 points, the system will determine that the lip language processing data has the corresponding text meaning, and the system will display the text meaning; for example, if the sentence score generated by the system based on the text meanings is 55 points and the score reference preset by the sentence score table is 60 points, the system will determine that the text meaning of the lip language processing data is ambiguous, and the system will not display the text meaning clearly, so the system will not display the text meaning.

In this embodiment, acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, dynamically recognizing the recognition area by using the scanner, and acquiring the content to be recognized of the recognition area, before step S1, including:

s101: capturing decibel data in an environment;

s102: judging whether the decibel data is larger than a preset reading threshold value or not;

s103: if yes, stopping capturing the decibel data in the environment, starting a preset scanner, and acquiring the image data in the preset range based on the temperature information in the preset range.

In this embodiment, the intelligent system determines whether the decibel data is greater than a preset readable threshold by capturing the decibel data in the current environment, so as to execute a corresponding step; for example, the system captures 70db of db data in the current environment and the preset readable threshold is 120db, that is, the system determines that db data in the current environment is not greater than the preset readable threshold, and the system does not enable the lip reading function, but still applies the scanner to scan the user to attempt to read the voice data input by the user; for example, the system captures 150db of db data in the current environment and the preset readable threshold is 120db, that is, the system determines that db data in the current environment is greater than the preset readable threshold, and then the system stops capturing db data in the environment and starts a preset scanner function to scan temperature information in a preset range, and tries to collect image data in the range to read face information in the image data.

Referring to fig. 2, a data processing system based on a smart watch identification lip language according to an embodiment of the present invention includes:

the acquisition module 10 is configured to acquire facial image data of a user based on a preset scanner, generate a recognition area to be captured according to the facial image data, and dynamically recognize the recognition area by using the scanner to obtain content to be recognized of the recognition area;

the prediction module 20 is configured to input location information in the recognition area into a training model for training to obtain a trained prediction model, analyze the content to be recognized to obtain a transformation track of the recognition area, perform data synthesis on the transformation track based on dynamic transformation, generate lip data of the user, input the lip data into the prediction model for prediction, and obtain voice data of the user by prediction, where the transformation track is specifically a transformation track of a mouth motion of the user during speaking, and the location information includes an upper lip, a lower lip, a lip angle and a lip valley;

the capturing module 30 is configured to extract the voice data by using a preset database, and capture language features appearing in the voice data, where the language features include a sound production, a language and a language branch dialect;

The judging module 40 is configured to parse the lip language data based on the voice data and the language features, generate lip language processing data of the user, and judge whether the lip language processing data is clear according to a preset sentence scoring table;

and the execution module 50 is configured to translate the text of the lip processing data if the text is generated, translate the lip processing data based on a preset mapping relationship, generate at least one or more language texts corresponding to the lip processing data, and generate the language texts in a preset display screen, wherein the language texts are specifically language type texts arranged based on a preset priority.

In this embodiment, the acquisition module 10 acquires facial image data of a user wearing the smart watch based on a preset scanner, generates a recognition region to be captured, namely a mouth region, according to the facial image data, and dynamically recognizes the mouth region by using the scanner to acquire voice content to be recognized of the mouth region; accordingly, the scene is summarized as follows: because the user is in a noisy environment, the intelligent watch cannot record voice content about the user through the capture volume, and therefore the user mouth area is scanned to obtain the voice content to be identified in the user mouth area; then, the prediction module 20 takes corresponding position information (including an upper lip position, a lower lip position, a lip angle position and a lip valley position of a mouth region) in the recognition region as a training sample, inputs the training sample into a blank training model to train to obtain a trained prediction model, analyzes the voice content to be recognized to obtain a lip transformation track of the user corresponding to the voice content to be recognized, synthesizes data of the transformation track based on a dynamic transformation process to generate lip data corresponding to the voice content to be recognized by the user, and predicts the lip data by applying a prediction model to obtain voice data corresponding to the user; the capturing module 30 captures language features appearing in the voice data by applying a preset large database to perform feature extraction on the voice data of the user, including sounding features, language types and branch dialects corresponding to the language types appearing in the voice data; the judging module 40 analyzes the lip language data of the user based on the voice data and the language characteristics corresponding to the voice data to generate lip language processing data of the user, and then judges whether the lip language processing data is clear or not according to a statement scoring table preset so as to execute the corresponding steps; for example, the system judges that the lip language processing data is not clear according to the sentence scoring table, namely the system judges that the meaning of the voice content to be recognized of the user is not clear at the moment, and the intelligent watch displays a preset prompt word of 'please repeat the voice content to be read' into a display screen of the watch so as to remind the user to repeat the voice content with the unclear meaning just, and the system carries out reading recognition again; for example, the system determines that the lip processing data is clear according to the sentence scoring table, that is, the execution module 50 determines that the input voice content to be recognized of the user has a corresponding meaning, at this time, the smart watch translates the meaning, translates the lip processing data based on a preset mapping relationship, generates at least one or more language texts corresponding to the translated text (for example, language text branches of which the translated text is in chinese language: mandarin, white language and Charactizing words, for example, language text branches of which the translated text is in spanish: gatailoniya, bask and gat Li Xiya), and finally generates these language texts in the display screen of the smart watch.

In this embodiment, the acquisition module further includes:

In this embodiment, the prediction module further includes:

the construction unit is used for constructing each reference axis taking the identification area as a reference coordinate system and taking the position information as each coordinate of an independent point based on the dynamic scene recorded in the content to be identified;

the extraction unit is used for extracting each reference axis and each coordinate as an image to be converted, mapping the image to be converted into a preset space template according to a preset conversion proportion, generating a converted image corresponding to the image to be converted, and acquiring a dynamic conversion track of the converted image by applying a preset frame rate based on view angle information preset by the space template and the dynamic scene.

In this embodiment, the capturing module further includes:

a second extraction unit configured to extract period data of at least one or more predetermined time points in the voice data;

the encoding unit is used for encoding the time period data by applying a preset encoder, converting the time period data into voice feature vectors corresponding to each preset time point, and generating voice features corresponding to the voice data based on the voice feature vectors, wherein the encoder specifically characterizes each vector of the voice data based on an embedded layer, and is based on each vector in the time period data set.

In this embodiment, the judging module further includes:

the conversion unit is used for converting the lip language processing data into text information in a preset format based on the language type;

the reading unit is used for inputting the text information into a preset reading model for reading, and generating sentence scores corresponding to the text information based on the text meanings of the text information;

the second judging unit is used for judging whether the sentence score is larger than a score standard preset in the sentence score table or not;

and the second execution unit is used for judging that the lip language processing data has corresponding meanings if yes.

In this embodiment, further comprising:

the second capturing module is used for capturing decibel data in the environment;

the second judging module is used for judging whether the decibel data is larger than a preset reading threshold value or not;

and the second execution module is used for stopping capturing the decibel data in the environment if the temperature information in the preset range is detected, and starting a preset scanner to acquire the image data in the preset range.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The data processing method based on the intelligent watch identification lip language is characterized by comprising the following steps of:

acquiring facial image data of a user based on a preset scanner, generating a recognition area to be captured according to the facial image data, dynamically recognizing the recognition area by using the scanner, and acquiring the content to be recognized of the recognition area, wherein the recognition area is a mouth area of the user, and the content to be recognized is specifically voice content to be recognized in the mouth area of the user by scanning the mouth area of the user;

the method comprises the steps of inputting position information in an identification area into a training model for training to obtain a trained prediction model, analyzing content to be identified to obtain a transformation track of the identification area, carrying out data synthesis on the transformation track based on dynamic transformation to generate lip data of a user, inputting the lip data into the prediction model for prediction to obtain voice data of the user, wherein the transformation track is specifically a transformation track of mouth action of the user during speaking, the position information comprises an upper lip, a lower lip, lip angles and lip valleys, the position information can be used as a training sample of the training model, the training model is specifically a blank training model, and the prediction model can predict the lip data corresponding to the content to be identified to obtain voice data corresponding to the lip data;

2. The method for processing data based on the smart watch lip language according to claim 1, wherein the step of acquiring the face image data of the user based on the preset scanner, generating the identification area to be captured according to the face image data, dynamically identifying the identification area by using the scanner, and acquiring the content to be identified of the identification area comprises the following steps:

Capturing the current scannable width and the scanning distance of the user;

3. The method for processing data based on the identification lip language of the smart watch according to claim 1, wherein the step of inputting the location information in the identification area into a training model for training to obtain a trained prediction model comprises the following steps:

4. The method for processing data based on smart watch lip language identification according to claim 1, wherein the step of analyzing the content to be identified to obtain a transformation track of the identification area and synthesizing the transformation track based on dynamic transformation comprises:

5. The smart watch-based lip-language recognition data processing method according to claim 1, wherein the step of extracting the voice data by using a preset database and capturing language features appearing in the voice data comprises:

6. The method for processing data based on intelligent watch lip recognition according to claim 1, wherein the step of analyzing the lip data based on the voice data and the language features to generate the lip processing data of the user and judging whether the lip processing data is clear according to a preset sentence scoring table comprises the steps of:

7. The method for processing the data based on the smart watch lip language according to claim 1, wherein the step of acquiring the face image data of the user based on the preset scanner, generating the identification area to be captured according to the face image data, dynamically identifying the identification area by using the scanner, and acquiring the content to be identified of the identification area comprises the following steps:

capturing decibel data in an environment;

8. Data processing system based on intelligence wrist-watch discernment lip language, its characterized in that includes:

The acquisition module is used for acquiring facial image data of a user based on a preset scanner, generating an identification area to be captured according to the facial image data, dynamically identifying the identification area by using the scanner, and acquiring the content to be identified of the identification area, wherein the identification area is the mouth area of the user, and the content to be identified is specifically voice content to be identified of the mouth area of the user by scanning the mouth area of the user;

the prediction module is used for inputting the position information in the identification area into a training model for training to obtain a trained prediction model, analyzing the content to be identified to obtain a transformation track of the identification area, carrying out data synthesis on the transformation track based on dynamic transformation to generate lip language data of the user, inputting the lip language data into the prediction model for prediction to obtain voice data of the user, wherein the transformation track is specifically a transformation track of mouth action when the user speaks, the position information comprises an upper lip, a lower lip, lip angles and lip valleys, the position information can be used as a training sample of the training model, the training model is specifically a blank training model, and the prediction model can predict the lip language data corresponding to the content to be identified to obtain voice data corresponding to the lip language data;

9. The smart watch-based lip language identification data processing system of claim 8, wherein the acquisition module further comprises:

10. The smart watch-based lip-recognition data processing system of claim 8, wherein the prediction module further comprises: