CN108875743B

CN108875743B - Text recognition method and device

Info

Publication number: CN108875743B
Application number: CN201710337521.XA
Authority: CN
Inventors: 王凯; 毛仁歆
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2017-05-15
Filing date: 2017-05-15
Publication date: 2022-02-22
Anticipated expiration: 2037-05-15
Also published as: CN108875743A

Abstract

The embodiment of the application discloses a text recognition method and a text recognition device, wherein the method comprises the following steps: segmenting a text to be recognized to generate a segmented text, and splicing the segmented text according to a standard text stored in a pre-established standard text library to generate a spliced text; the standard text library at least stores a standard text set corresponding to part or all of the segmented texts, the standard text set comprises at least one standard text, a matching representation value between the spliced text and the corresponding standard text is determined, the standard text matched with the spliced text is selected according to the matching representation value, and a recognition result is generated. By the method, sample data does not need to be acquired for training, so that the training optimization process is saved, and meanwhile, the recognition accuracy of keywords/phrases in the text to be recognized can be improved by the process of splicing and matching the segmented text based on the standard text library.

Description

Text recognition method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text recognition method and apparatus.

Background

At present, with the development of information technology, the number of application scenarios for performing semantic recognition on a text is increasing, for example: intelligent question and answer, intelligent customer service, search engine and other business scenes. Semantic recognition technology has become the subject of current intense research.

The existing semantic recognition technology adopts a machine learning or deep learning mode to recognize the semantic. However, both machine learning and deep learning require a large amount of sample data to be used for repeated training and optimization, so as to improve the recognition accuracy of the recognition model, and the process is complicated. Moreover, when machine learning or deep learning is applied to an actual business scenario, the running time is usually long.

In other words, it is a problem to be solved urgently to perform a less cumbersome and efficient recognition process on a text to be recognized.

Disclosure of Invention

The embodiment of the application provides a text recognition method and a text recognition device, which are used for solving the problem that the existing text recognition based on machine learning or deep learning has certain defects.

The text recognition method provided by the embodiment of the application comprises the following steps:

segmenting a text to be recognized to generate a segmented text;

according to a standard text stored in a pre-established standard text base, splicing the segmented texts to generate spliced texts; the standard text library at least stores a standard text set corresponding to part or all of the segmented texts, and the standard text set comprises at least one standard text;

determining a matching representation value between the spliced text and the corresponding standard text;

and selecting a standard text matched with the spliced text according to the matching characterization value to generate a recognition result.

The embodiment of the application provides a text recognition device, including:

the text segmentation module is used for segmenting the text to be recognized to generate a segmented text;

the text splicing module is used for splicing the segmented texts according to standard texts stored in a pre-established standard text base to generate spliced texts; the standard text library at least stores a standard text set corresponding to part or all of the segmented texts, and the standard text set comprises at least one standard text;

the score determining module is used for determining a matching representation value between the spliced text and the corresponding standard text;

and the result generation module is used for selecting the standard text matched with the spliced text according to the matching representation value and generating a recognition result.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects:

after the text to be recognized is received, text segmentation can be performed on the text to be recognized, and a plurality of segmented texts are generated. In these segmented texts, incorrectly divided words, phrases or words may be included, in which case the segmented texts may be concatenated according to a pre-established standard text library. The standard text library of the application at least stores standard segmentation texts and a text set composed of the standard texts containing the standard segmentation texts. Therefore, in the process of text splicing, matching characteristic values between each spliced text obtained by splicing and different standard texts in the text set can be further determined, so that a proper standard text can be selected to be matched with the text to be recognized based on the matching characteristic values, and the recognition processing of the text to be recognized is realized.

Compared with the recognition processing mode depending on machine learning or deep learning in the prior art, the text recognition method in the embodiment of the application does not need to acquire sample data for training, so that the training optimization process is saved, meanwhile, the process of splicing and matching the segmented text based on the standard text library can improve the recognition precision of keywords/phrases in the text to be recognized, and can perform normalization processing on the phrases.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1a is a schematic diagram of an architecture based on which a text recognition method according to an embodiment of the present application is provided;

FIG. 1b is a process for text processing provided by an embodiment of the present application;

fig. 2 is a schematic structural diagram of a text recognition apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As mentioned above, the existing methods of machine learning and deep learning for semantic recognition of text usually require a large amount of sample data for training, and the training process using the sample data often takes a long time. Meanwhile, machine learning and deep learning are time-consuming to operate in an actual service scene, and real-time or near-real-time recognition response is difficult to achieve under the condition of large processing amount.

Therefore, in the embodiment of the application, a text recognition method independent of machine learning or deep learning is provided, and the text to be recognized is recognized quickly and accurately.

In particular, the method may employ an architecture as shown in fig. 1a, which can be seen in fig. 1a, comprising: users with service requirements and servers with text recognition functions.

Wherein, the user can realize the interaction with the server through a client (such as an application client, a browser, etc.). The server can be a server with a text recognition function at the background of a service provider (such as a website, a telecommunication operator, a bank, a data center and the like). In practical application, the server can run corresponding service systems such as a search engine, an intelligent question-answering system, an intelligent customer service system, an intelligent chatting system and the like, and the service based on text recognition is provided for the user by executing the text recognition method in the application. Of course, the architecture shown in fig. 1a is only a simple architecture shown for facilitating understanding of the method in the embodiment of the present application, and in practical applications, the servers may also be clustered and provide text recognition based business services to a large number of users. And should not be construed as limiting the application herein.

It should be noted that the text recognition method in the embodiment of the present application is applicable to recognition scenes of different languages, and in the following description, the description is mainly given for a chinese text, and reference may be made to the description of the processing procedure of the chinese text for the processing procedure of texts of other languages.

In addition, in the embodiment of the present application, the words, phrases, short sentences, long sentences and/or combinations thereof input by the user are referred to as "text to be recognized". In one mode, the text to be recognized can be edited and input by the user through the corresponding client, and in another mode, the text to be recognized can be the text which is input by the user through voice and obtained after voice conversion. And should not be construed as limiting the application herein.

Based on the architecture shown in fig. 1a, an embodiment of the present application provides a text processing process, which specifically includes the following steps, as shown in fig. 1 b:

step S101: and segmenting the text to be recognized to generate a segmented text.

In the embodiment of the application, the segmentation processing of the text to be recognized can be realized through a word segmentation tool (or algorithm). It is understood that after the text to be recognized is segmented, several segmented texts (each segmented text described in this step is understood to be at least two segmented texts) are obtained. Wherein the segmented text may comprise: words, phrases and/or single characters obtained after segmentation, etc.

For example: for the text to be recognized "sales amount is low in last year", the segmented text may include "last year", "sales amount", "very", "low".

It should be noted that, in practical applications, the text to be recognized may contain irregular or spoken words, and the existing segmentation method is generally to segment the text to be recognized based on its own lexicon, and its own lexicon may not contain some specific special words or phrases. Thus, each of the divided texts obtained after the division process may include an erroneously divided word, or phrase. The following steps are performed to correct and optimize the segmented text so as to finally complete the recognition processing of the text to be recognized.

Step S102: and according to the standard text stored in a pre-established standard text library, splicing the segmented texts to generate a spliced text.

In the embodiment of the present application, the texts stored in the standard text library at least include: and standard words and phrases according with the grammar of the natural language, and standard texts such as special words and special phrases related to the business.

As an implementation manner in the embodiment of the present application, the standard text library may be further divided into: an Inverted Index (Inverted Index) library and a standard thesaurus. The inverted index library stores a standard text set corresponding to part or all of the segmented texts, the standard text set comprises at least one standard text, and the standard text in the standard text set comprises the segmented text. And the standard word segmentation library stores possible splicing combinations among different segmented texts. In actual operation, the standard text can be determined and acquired by adopting technologies such as data mining and text analysis, and the like, and the method can also be combined with a manual entry mode of an operator, and the method does not limit the application.

For the inverted index library, the standard word segmentation library and the splicing process, detailed descriptions will be specifically provided in the following contents, and will not be described herein.

In the embodiment of the present application, the segmented texts after the splicing process are collectively referred to as a spliced text, regardless of whether the segmented texts are spliced into a new text combination.

After the splicing processing, the matching degree of some obtained spliced texts and the standard texts is higher, and the matching degree of some spliced texts and the standard texts is lower relatively. In order to determine the appropriate concatenated text, the following steps S103 and S104 are performed.

Step S103: and determining a matching representation value between the spliced text and all the corresponding standard texts.

As mentioned above, the segmented text may correspond to a plurality of standard texts, and after the splicing, the spliced text may still correspond to a plurality of standard texts, but the matching degree of the spliced text and the corresponding plurality of standard texts is different. In order to quantify the difference, in the embodiment of the present application, matching characteristic values between the spliced text and all corresponding standard texts are determined.

It should be noted that, as a feasible manner in the embodiment of the present application, the matching characteristic value may be calculated based on the edit distance. Specifically, the matching characteristic values are:

1-edit distance/max (length (concatenation text), length (standard text)).

Wherein length represents the character length of the text;

edit distance/max (length (spliced text), length (standard text)), which represents a difference characteristic value between the spliced text and the standard text (i.e., a degree of difference therebetween).

Of course, no limitation to the present application should be construed thereby.

Step S104: and selecting a standard text matched with the spliced text according to the matching characterization value to generate a recognition result.

As a feasible way in the embodiment of the present application, a standard text having a matching representation value not less than a matching threshold value may be selected as a standard text matched with the spliced text according to the corresponding matching threshold value in combination with the matching representation value. In other words, the selected standard text has substantially the same semantic meaning as the corresponding spliced text. Further, the spliced text can be converted into the standard text (in the embodiment of the present application, a process of converting the spliced text into the standard text is referred to as normalization processing), so that a recognition process of the text to be recognized is completed, and a recognition result is generated.

It should be understood that the generated recognition result can be called by other business systems and provide corresponding business services based on the calling. And will not be described in excessive detail herein.

Through the steps, after the text to be recognized is received, text segmentation can be carried out on the text to be recognized, and a plurality of segmented texts are generated. In these segmented texts, incorrectly divided words, phrases or words may be included, in which case the segmented texts may be concatenated according to a pre-established standard text library. The standard text library of the application at least stores standard segmentation texts and a text set composed of the standard texts containing the standard segmentation texts. Therefore, in the process of text splicing, matching characteristic values between each spliced text obtained by splicing and different standard texts in the text set can be further determined, so that a proper standard text can be selected to be matched with the text to be recognized based on the matching characteristic values, and the recognition processing of the text to be recognized is realized.

In order to clearly illustrate the text recognition method in the embodiments of the present application, the following description will be made in detail with reference to examples.

Specifically, the method comprises the following steps:

one, reverse index library

The inverted index library in the embodiment of the present application may be a relational database, such as: MySQL, Hbase, etc., and may also be a file, such as: txt files, Excel files, etc., which do not constitute a limitation of the present application.

In practical applications, the inverted index library may be generated based on the base text library. The basic text library may be a database storing a large amount of natural language texts and service texts, and as mentioned above, the service texts may be service specific words, service specific phrases, and the like of the service provider itself. The data in the inverted index library may originate from the base text library. Certainly, in the embodiment of the present application, it is not necessary to independently create a certain database as the basic text library, and in practical application, the database in which the standard service text is stored in the background of the service provider and the database in which the natural language text is stored may be directly used as the basic text library. Of course, such embodiments do not constitute a limitation of the present application.

As a practical feasible way, the inverted index library may adopt a data table way, and as shown in table 1 below, is an inverted index library in this way:

TABLE 1

As can be seen from table 1, the standard segmented text of the inverted index library may include words, phrases, and/or sentences (the text shown in table 1 is only exemplified by words). Each standard segmented text stored in the inverted index repository has a unique text label. Note that the standard segmented text shown in table 1 may be obtained by performing text segmentation processing on the standard text in advance by using a corresponding segmentation tool (the segmentation tool is the same as the segmentation tool used for segmenting the text to be recognized in the foregoing description).

In the inverted index library, each content in the standard text set is expressed in a format of "text, code, type", and it should be understood that this format is only an example, and different format structures may be adopted in practical applications.

Specifically, the "text" in the structure is the standard text described in the embodiments of the present application, and the standard text contains or partially contains the corresponding standard segmented text. As shown in table 1, the text "e-commerce refund amount" includes the standard segmented text "e-commerce", and the text "per capita consumption" includes the standard segmented text "consumption".

The "code" indicates a code corresponding to the text in the basic text library, that is, by the code, the text content or related information corresponding to the code can be found in the basic text library. Such as: according to the code KPI0001, the text E-commerce refund amount can be found in the basic text base.

"type" indicates a text type corresponding to the standard text. In the embodiment of the present application, the text types may include: business object name, attribute value.

Wherein a business object can be considered as an object involved in different business services of a business provider. Including but not limited to: business product, user, various business indexes, and the like. Accordingly, the business object name is a specific name of the business object, such as: a service product name, a user name, a service index name, etc.

The attribute name can be considered as the name of the service attribute of the service object, such as: for the business object "store," its attribute names may include: store address, per-person consumption, etc. For another example: for the business object "user", the attribute name may include: sex, age, etc.

The attribute value may be considered as a value corresponding to the attribute name, for example: values for "gender" include: male and female.

In summary, for example: assuming that the average consumption of all people in the three ken degki hangzhou wen stores is "the text type of the text" three ken degki hangzhou wen stores "is: business object name (i.e., store name), the text type of the text "everyone consumes" is: an attribute name.

Two, standard word-separating library

The standard word-dividing library can also adopt a relational database/table mode. The content in the standard word segmentation library can be considered as a word combination obtained based on N-gram word segmentation processing. In practical application, the value of N may be set according to the needs of practical application, and is not specifically limited herein.

It should be noted that the content stored in the standard word segmentation library is generally a word segmentation result of standard text.

As a feasible manner in the embodiment of the present application, the standard word segmentation library may be as shown in table 2 below (table 2 is a binary grammar library):

text labels	Text	Word segmentation mode
			1	Consumption by electric business	2-gram
2	Amount of refund	2-gram
			3	Amount of consumption	2-gram
4	Consumption per capita	2-gram

TABLE 2

The text in table 2, which is generally derived from the aforementioned standard text, such as "refund amount", "average person consumption", etc., represents a normative division of the standard text. In practical application, the text to be recognized may include a standard text, but after being segmented by the segmentation tool, segmented texts such as "refund", "amount", "average person", "consumption", and the like are formed, so that in the process of splicing the texts, the standard segmentation library shown in table 2 can provide reference for splicing the segmented texts, and can be used as a judgment condition in the splicing process (which will be described in the following contents).

The word segmentation in table 2 is binary grammar word segmentation, and in practical application, other types of word segmentation grammars, such as 3-gram (ternary grammar word segmentation) and 4-gram (quaternary grammar word segmentation), may be used.

Of course, table 2 is only one example of a standard thesaurus and should not be construed as limiting the present application.

As can be seen from the above, the process of pre-establishing the standard text library may include: acquiring a standard text, and performing segmentation processing on the standard text to generate a standard segmented text;

and taking each standard segmentation text as an index, counting the standard texts containing or partially containing the index to form a standard text set, establishing a corresponding relation between the standard text set and the index, and establishing the inverted index library based on the corresponding relation.

And according to the set word segmentation grammar, determining a text combination which accords with the word segmentation grammar in the standard segmentation text, and establishing the standard word segmentation library based on the text combination.

In the embodiment of the present application, the splicing processing is performed on the segmented text, specifically: and selecting the segmented texts to splice according to the arrangement sequence of the segmented texts.

And the arrangement sequence of the plurality of segmented texts is consistent with the arrangement sequence of the texts in the texts to be recognized.

Selecting and splicing the segmented texts, specifically: selecting an initial segmentation text, searching an index corresponding to the initial segmentation text in the inverted index library, if the index corresponding to the initial segmentation text is found, determining a standard text set corresponding to the index, and splicing according to the standard text in the standard text set; and if the index corresponding to the initial segmentation text is not found, selecting the next segmentation text as the initial segmentation text, and splicing according to the newly selected initial segmentation text.

Splicing according to the standard texts in the standard text set, specifically: selecting adjacent segmented texts arranged behind the initial segmented text, performing cumulative splicing to generate a spliced text, determining a difference characteristic value between the spliced text and a standard text in a corresponding standard text set, recording the spliced text according to the difference characteristic value, judging whether splicing is finished by the selected initial segmented text, if so, screening the spliced text, and reselecting the initial segmented text; otherwise, selecting the adjacent segmented texts arranged behind the spliced text, and continuing to perform cumulative splicing.

Determining a difference representation value between the spliced text and the standard text in the corresponding standard text set, specifically: and calculating the difference representation value according to the editing distance between the spliced text and the standard text in the corresponding standard text set.

Recording the spliced text, specifically: and recording the spliced text when the spliced text meets the set recording condition.

Wherein the set recording condition includes: and the difference representation value corresponding to the spliced text is smaller than the current minimum difference representation value and the set difference representation threshold value.

Judging whether the selected initial segmentation text is spliced or not, specifically: when the difference representation value does not meet the set condition, judging that the selected initial segmentation text is spliced after finishing the splicing; when the difference representation values meet set conditions, judging to continue splicing with the selected initial segmentation texts;

wherein the setting conditions include:

the minimum variance characterizing value does not continuously increase for a set number of times, an

The minimum difference representation value corresponding to the current spliced text is not larger than the minimum difference representation value corresponding to the previous spliced text, or the text combination between the currently selected segmented text and the next segmented text exists in the standard segmented word bank.

With reference to the contents of the inverted index library and the standard word segmentation library, the following description will be given to the recognition process:

specific examples thereof include: assuming that the text to be recognized input by the user is "the electricity merchant consumption amount of the female user is sorted according to the constellation", after the text to be recognized is segmented by adopting the existing segmentation method, the following segmented text can be obtained:

female user's e-commerce consumption amounts sorted by constellation

As can be seen, after text segmentation, 9 segmented texts are obtained. Since the 9 divided texts include a text that is erroneously divided, the 9 divided texts are subjected to a concatenation process. It should be noted here that the plurality of segmented texts obtained after text segmentation are substantially arranged in order (i.e., a segmented text sequence is formed, and the arrangement order of the segmented texts is consistent with the order of the text to be recognized), so when the split text sequence is spliced, the split text sequence is also based on the segmented text sequence.

First, in the above-described segmented text sequence, the first segmented text (i.e., the starting segmented text), i.e., "female", is selected. And searching an index corresponding to the segmented text in an inverted index library. Assuming that table 1 is used as the inverted index library, the index "female" in table 1 can be named according to the segmented text. Further, a standard text set corresponding to the index "female" may be determined in table 1, and the standard text set only includes one standard text: "female". Obviously, the segmented text matches the standard text exactly, with a matching token value of 1.0 (a difference token value of 0). At this time, the splicing result obtained in the current splicing process can be recorded, for example: and recording the initial segmentation text, the standard text, the matching representation value, the text type and the like of the current splicing. And should not be construed as limiting the application herein.

It should be noted that, in the embodiment of the present application, if the starting segmented text is to be continued to be spliced based on the above-mentioned starting segmented text, a set splicing condition needs to be satisfied, where the splicing condition may include:

in the first condition, the minimum difference indicating value is not continuously increased by the set number of times (for convenience of description, the set number of times is referred to as "thresA"). Generally, thresA takes a value of 2. Specifically, thresA is used as a condition for judging whether to continue to splice adjacent segmented texts, and the value of thresA can also provide support for fuzzy matching.

And secondly, the minimum difference representation value corresponding to the current spliced text is not larger than the minimum difference representation value corresponding to the previous spliced text, or a text combination between the currently selected segmented text and the next segmented text exists in the standard segmented word bank.

In any splicing process, the splicing can be continued based on the previously selected initial segmentation text only if the conditions are simultaneously met, otherwise, the initial segmentation text is reselected for splicing.

And then, aiming at the segmented text sequence, selecting the next segmented text and the initial segmented text for continuous splicing, namely, selecting the user to splice with the previously selected female to obtain a spliced text female user. At this time, the stitched text still corresponds to the standard text set, which is not exactly matched with the standard text "female", and the difference between the two is 0.5.

Obviously, since the difference representation value 0.5 of the current spliced text "female user" is greater than the difference representation value 0 of the previous spliced text "female", the above condition two is not satisfied, and thus, the starting segmented text will be reselected.

In this example, the segmented text "user" is selected as the new starting segmented text and the matching index is looked up in the inverted index library without hitting, so the starting segmented text continues to be reselected.

The "of" segmented text is selected as the new initial segmented text, and likewise, if the initial segmented text is not hit in the inverted index database, the initial segmented text is continuously reselected.

At this time, the segmented text "e-commerce" is selected as a new initial segmented text, and the index "e-commerce" is hit in the inverted index library. Further, a standard text set corresponding to the index "e-commerce" may be determined in table 1, where the standard text set includes two standard texts: the "e-commerce refund amount" and the "e-commerce consumption amount". The initial segmentation text 'e-commerce' can be used as a splicing text, and difference representation values between the initial segmentation text 'e-commerce refund amount' and the initial segmentation text 'e-commerce consumption amount' and the standard text 'e-commerce refund amount' and 'e-commerce consumption amount' are respectively calculated. The characteristic value of the difference between the spliced text "e-commerce" and the standard text "e-commerce refund amount" and "e-commerce consumption amount" is 4/6, namely 0.67. Then, the minimum difference representation value corresponding to the current concatenation text "e-commerce" is 0.67. Similarly, the splicing result obtained in the current splicing process can be recorded, and redundant description is omitted here.

The minimum difference representation value 0.67 of the current splicing result is not continuously increased by thresA times (meeting the condition one), and the 'e-commerce' and the subsequent segmentation text 'consumption' are recorded in the standard word segmentation library, namely, table 2 (meeting the condition two). So, splicing will continue on the "e-commerce" basis (enter the next round of splicing).

And selecting a segmentation text 'consumption' arranged behind the 'e-commerce' for splicing to obtain a new splicing text 'e-commerce consumption'. The spliced text hits the index "e-commerce" in the inverted index library (note that, in the case that the spliced text is composed of a plurality of segmented texts, the spliced text is searched in the inverted index library based on the first segmented text in the spliced text), and the corresponding two standard texts are still: the "e-commerce refund amount" and the "e-commerce consumption amount". Calculating the difference characterization values as follows: 4/6(0.67) and 2/6 (0.33). In this case, the difference token 0.33 is smaller than the current minimum difference token 0.67, so the current minimum difference token is updated to 0.33 (that is, the standard text "e-commerce consumption amount" more matches the current concatenated text "e-commerce consumption"). And recording the splicing result of the splicing process.

The minimum difference characteristic value of the current splicing result is reduced from 0.67 to 0.33, the thresA times are not continuously increased (the condition one is met), and the updated minimum difference characteristic value 0.33 is smaller than the minimum difference characteristic value 0.67 of the previous splicing (the condition two is met). So, splicing will continue (enter the next round of splicing) based on "e-commerce consumption".

And selecting a segmented text 'amount' arranged after 'consumption' for accumulative splicing to obtain a new spliced text 'e-commerce consumption amount'. The processing of the spliced text can refer to the foregoing, and the difference characterization value between the spliced text "electricity merchant consumption amount" and the standard text "electricity merchant consumption amount" can be determined to be 0.17. Accordingly, the current minimum difference characterizing value of 0.33 is updated to 0.17. And recording the splicing result.

The subsequent processes are analogized, and are not described in detail here.

It should be noted that, as a possible implementation, in the above splicing process, the data table may be used to record the splicing result.

For example: for the above splicing text "female", the splicing result can be shown in table 3 below.

Serial number	Splicing text	Standard text	Matching the token values
				1	Female with a view to preventing the formation of wrinkles	Female with a view to preventing the formation of wrinkles	1.0

TABLE 3

The splicing results generated in the subsequent splicing process can be cumulatively recorded in table 3. For example: as shown in table 4 below.

Serial number	Splicing text	Standard text	Matching the token values
				1	Female with a view to preventing the formation of wrinkles	Female with a view to preventing the formation of wrinkles	1.0
2	Electronic commerce	Amount of e-commerce consumption	0.33
				3	Electronic commerce	Amount of refund from e-commerce	0.33

TABLE 4

However, it should be noted that in the embodiment of the present application, not all the obtained splicing results are recorded in the table in an accumulated manner, but the table is updated after different splicing results satisfy a certain recording condition. Specifically, the recording conditions may include:

the difference token value corresponding to the spliced text is less than the current minimum difference token value.

Of course, considering that the number of splicing results that may satisfy the condition in the practical application scenario is large, and thus, the number of records is increased, in order to ensure the processing efficiency, an additional condition may be added to the above-mentioned recording condition, that is: the difference representation value < set difference representation threshold value corresponding to the spliced text (for convenience of description, the set difference representation threshold value is referred to as "thresB"). Generally, thresB takes a value of 0.93.

Under the action of additional conditions, a certain number of splicing results can be directly filtered out, and meaningless operation is reduced. Of course, it should be understood that the higher the value of thresB, the fewer the directly filtered stitching results (more stitching results may be recorded), which can increase the recognition accuracy to some extent, but reduce the processing efficiency; on the contrary, the more splicing results are directly filtered (less splicing results are recorded), the recognition accuracy is reduced to a certain degree, and the processing efficiency can be improved. That is, the value of thresB can be set according to the needs of the actual application (0.93 in this example, is only one preferred value).

Continuing with the present example, assuming that the splicing process is complete, the obtained splicing results are shown in table 5 below.

Serial number	Splicing text	Standard text	Matching the token values
				1	Female with a view to preventing the formation of wrinkles	Female with a view to preventing the formation of wrinkles	1.0
2	Constellation	Constellation	1.0
				3	Amount of electricity merchant	Amount of e-commerce consumption	0.83
4	Consumption by electric business	Amount of e-commerce consumption	0.67
				5	Consumption by electric business	Amount of refund from e-commerce	0.33
6	Electronic commerce	Amount of e-commerce consumption	0.33

TABLE 5

At this time, the concatenation results in table 5 are screened according to a matching representation threshold (for convenience, the matching representation threshold is denoted as "thresC"), and if the value of thresC is 0.6, each standard text with a matching representation value greater than 0.6 is selected and normalized.

At this time, the processed segmentation results are obtained:

the e-commerce consumption amounts of female users are sorted according to the constellation

Of course, as a way in this embodiment of the present application, when generating the recognition result, the text type of the corresponding key text may be marked, for example:

{ "name": female "," type ": attribute value", "score": 1.0}

{ "name": constellation "," type ": attribute name", "score": 1.0}

{ "name": amount of electricity merchant consumption "," type ": business object name", "score": 0.83}

Of course, in different embodiments, it may have different forms, such as: anti-Document Frequency (IDF) information and the like may be added, and this should not be construed as limiting the present application.

Based on the same idea, the text recognition method provided in the embodiment of the present application further provides a text recognition apparatus, as shown in fig. 2. The device includes:

the text segmentation module 201 is used for segmenting the text to be recognized to generate segmented texts;

the text splicing module 202 is used for splicing the segmented texts according to standard texts stored in a pre-established standard text base to generate spliced texts; the standard text library at least stores a standard text set corresponding to part or all of the segmented texts, and the standard text set comprises at least one standard text;

the score determining module 203 is used for determining a matching representation value between the spliced text and the corresponding standard text;

and the result generation module 204 selects a standard text matched with the spliced text according to the matching representation value, and generates a recognition result.

The standard text library at least comprises: the device comprises an inverted index library and a standard word segmentation library, and further comprises: the text base creating module 205 is configured to obtain a standard text, perform segmentation processing on the standard text to generate a standard segmented text, use each standard segmented text as an index, count standard texts including or partially including the index to form a standard text set, create a correspondence between the standard text set and the index, create the inverted index base based on the correspondence, determine a text combination conforming to a word segmentation grammar in the standard segmented text according to a set word segmentation grammar, and create the standard word segmentation base based on the text combination;

wherein the set word segmentation grammar at least comprises: bigram, trigram.

The text splicing module 202 selects the segmented texts to splice according to the arrangement sequence of the segmented texts.

The text splicing module 202 selects an initial segmented text, and searches an index corresponding to the initial segmented text in the inverted index database;

if the index corresponding to the initial segmentation text is found, determining a standard text set corresponding to the index, and splicing according to the standard text in the standard text set;

and if the index corresponding to the initial segmentation text is not found, selecting the next segmentation text as the initial segmentation text, and splicing according to the newly selected initial segmentation text.

The text splicing module 202 selects adjacent segmented texts arranged after the initial segmented text, performs cumulative splicing to generate a spliced text, determines a difference characteristic value between the spliced text and a standard text in a corresponding standard text set, records the spliced text according to the difference characteristic value, and determines whether splicing with the selected initial segmented text is finished;

if so, screening the spliced text, and reselecting the initial segmentation text;

otherwise, selecting the adjacent segmented texts arranged behind the spliced text, and continuing to perform cumulative splicing.

The text splicing module 202 calculates the difference representation value according to the edit distance between the spliced text and the standard text in the corresponding standard text set.

The text splicing module 202 records the spliced text when the spliced text meets the set recording condition.

The text splicing module 202 determines to finish splicing the selected initial segmented text when the difference characteristic value does not satisfy the set condition, and determines to continue splicing the selected initial segmented text when the difference characteristic value satisfies the set condition.

Wherein the setting conditions include: the minimum variance characterizing value does not continuously increase for a set number of times, an

The score determining module 203 calculates the matching representation value according to the edit distance between the spliced text and the standard text in the corresponding standard text set.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular transactions or implement particular abstract data types. The application may also be practiced in distributed computing environments where transactions are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A text recognition method, comprising:

segmenting a text to be recognized to generate a segmented text;

according to a standard text stored in a pre-established standard text base, splicing the segmented texts to generate spliced texts; wherein the standard text library at least comprises: an inverted index library and a standard word segmentation library; the standard text library at least stores a standard text set corresponding to part or all of the segmented texts, and the standard text set comprises at least one standard text;

selecting a standard text matched with the spliced text according to the matching characterization value, and generating a recognition result;

splicing the segmented texts, specifically comprising:

selecting the segmented texts to splice according to the arrangement sequence of the segmented texts;

the arrangement sequence of the plurality of segmented texts is consistent with the arrangement sequence of texts in the texts to be recognized;

selecting and splicing the segmented texts, specifically comprising:

selecting a starting segmentation text;

searching an index corresponding to the initial segmentation text in the inverted index library;

if the index corresponding to the initial segmentation text is not found, selecting the next segmentation text as the initial segmentation text, and splicing according to the newly selected initial segmentation text;

splicing according to the standard texts in the standard text set, which specifically comprises the following steps:

selecting adjacent segmented texts arranged behind the initial segmented text, and performing cumulative splicing to generate spliced texts;

determining a difference representation value between the spliced text and a standard text in a corresponding standard text set;

recording the spliced text according to the difference representation value, and judging whether splicing is finished by the selected initial segmentation text;

2. The method of claim 1, wherein the first and second light sources are selected from the group consisting of a red light source, a green light source, and a blue light source,

the method specifically comprises the following steps of pre-establishing a standard text library:

acquiring a standard text;

performing segmentation processing on the standard text to generate a standard segmented text;

taking each standard segmentation text as an index, counting the standard texts containing or partially containing the index to form a standard text set, establishing a corresponding relation between the standard text set and the index, and establishing the inverted index library based on the corresponding relation;

determining a text combination which accords with the word segmentation grammar in the standard segmentation text according to the set word segmentation grammar, and establishing the standard word segmentation library based on the text combination;

wherein the set word segmentation grammar at least comprises: bigram, trigram.

3. The method according to claim 1, wherein determining a difference representation value between the stitched text and the standard text in the corresponding standard text set specifically comprises:

and calculating the difference representation value according to the editing distance between the spliced text and the standard text in the corresponding standard text set.

4. The method according to claim 1, wherein the recording of the stitched text specifically comprises:

recording the spliced text when the spliced text meets the set recording condition;

5. The method of claim 1, wherein the step of determining whether to terminate the splicing with the selected start segmented text comprises:

when the difference representation value does not meet the set condition, judging that the selected initial segmentation text is spliced after finishing the splicing;

when the difference representation values meet set conditions, judging to continue splicing with the selected initial segmentation texts;

6. The method according to claim 1, wherein determining a matching token between the stitched text and the corresponding standard text specifically comprises:

and calculating the matching representation value according to the editing distance between the spliced text and the standard text in the corresponding standard text set.

7. A text recognition apparatus comprising:

the text splicing module is used for splicing the segmented texts according to standard texts stored in a pre-established standard text base to generate spliced texts; wherein the standard text library at least comprises: an inverted index library and a standard word segmentation library; the standard text library at least stores a standard text set corresponding to part or all of the segmented texts, and the standard text set comprises at least one standard text;

the result generation module is used for selecting a standard text matched with the spliced text according to the matching representation value and generating a recognition result;

the text splicing module selects and splices the segmented texts according to the arrangement sequence of the segmented texts;

the text splicing module selects an initial segmentation text and searches an index corresponding to the initial segmentation text in the inverted index library;

the text splicing module selects adjacent segmented texts arranged behind the initial segmented text, performs cumulative splicing to generate a spliced text, determines a difference representation value between the spliced text and a standard text in a corresponding standard text set, records the spliced text according to the difference representation value, and judges whether splicing with the selected initial segmented text is finished;

8. The apparatus as set forth in claim 7,

the device further comprises: the system comprises a text base creation module, a text base creation module and a text base search module, wherein the text base creation module is used for acquiring a standard text, segmenting the standard text to generate a standard segmented text, taking each standard segmented text as an index, counting the standard text containing or partially containing the index to form a standard text set, establishing a corresponding relation between the standard text set and the index, establishing an inverted index base based on the corresponding relation, determining a text combination conforming to a word segmentation grammar in the standard segmented text according to a set word segmentation grammar, and establishing the standard word segmentation base based on the text combination;

wherein the set word segmentation grammar at least comprises: bigram, trigram.

9. The apparatus of claim 7, the text stitching module to calculate the difference characterization value based on an edit distance between the stitched text and standard text in a corresponding set of standard text.

10. The device of claim 7, wherein the text splicing module records the spliced text when the spliced text meets a set recording condition;

11. The apparatus of claim 7, wherein the text stitching module determines to end the stitching with the selected starting segmented text when the difference characteristic value does not satisfy a set condition, and determines to continue the stitching with the selected starting segmented text when the difference characteristic value satisfies the set condition;

12. The apparatus of claim 7, the score determination module to calculate the matching token value based on an edit distance between the stitched text and standard text in a corresponding set of standard text.