CN114997138A - Chemical specification analysis method, device, equipment and readable storage medium - Google Patents

Chemical specification analysis method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN114997138A
CN114997138A CN202210699721.0A CN202210699721A CN114997138A CN 114997138 A CN114997138 A CN 114997138A CN 202210699721 A CN202210699721 A CN 202210699721A CN 114997138 A CN114997138 A CN 114997138A
Authority
CN
China
Prior art keywords
text
line
chapter
block
blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210699721.0A
Other languages
Chinese (zh)
Other versions
CN114997138B (en
Inventor
卞晓瑜
肖鸣林
张标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yida Technology Shanghai Co ltd
Original Assignee
Yida Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yida Technology Shanghai Co ltd filed Critical Yida Technology Shanghai Co ltd
Priority to CN202210699721.0A priority Critical patent/CN114997138B/en
Publication of CN114997138A publication Critical patent/CN114997138A/en
Application granted granted Critical
Publication of CN114997138B publication Critical patent/CN114997138B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, equipment and a readable storage medium for analyzing a chemical specification, wherein the method comprises the following steps: analyzing to obtain each line text block corresponding to each page of text of the chemical specification and sequencing; removing header and footer text blocks corresponding to each page of text according to the set character editing distance to obtain text line text blocks; acquiring a target text meeting set conditions in each text line text block, setting the set conditions according to the possible positions of chapter titles, and clustering the target text according to preset fonts, word sizes and position coordinates to obtain the head of a chemical specification and the chapter titles; and combining the chapter titles and the corresponding chapter texts into chapter texts. The writing specification of the chemical specification stipulates that the number of the chapter titles is fixed, and each chapter title has a corresponding fixed name, so that the clustering target of the target text is defined, each chapter title is clearly and accurately obtained, and subsequent analysis is facilitated.

Description

Chemical specification analysis method, device, equipment and readable storage medium
Technical Field
The present application relates to the field of chemical specification parsing technology, and more particularly, to a chemical specification parsing method, apparatus, device, and readable storage medium.
Background
The existing method for analyzing the chemical specification generally identifies the text or picture blocks in the chemical specification according to a general pdf analysis method, and then analyzes the document according to keywords and corresponding texts of the keywords to extract the required text.
However, the existing method for analyzing the chemical specification does not fully utilize the writing specification, the text structure characteristics and the industrial characteristics of the chemical industry of the chemical specification, so that various problems exist in analyzing the chemical specification, such as: the extraction speed is slow, or the required text can not be extracted accurately, or a plurality of texts are extracted under the same extraction condition, etc.
Therefore, an analysis scheme aiming at the chemical specification and capable of improving the extraction accuracy of the chemical specification is very worthy of research.
Disclosure of Invention
In view of the above, the present application provides a method, an apparatus, a device and a readable storage medium for analyzing a chemical specification, which are used for an analysis scheme aiming at the chemical specification and capable of improving the extraction accuracy of the chemical specification.
In order to achieve the above object, the following solutions are proposed:
a method of resolving a chemical specification, comprising:
analyzing texts of a chemical specification to obtain line text blocks corresponding to each page of texts of the chemical specification, wherein each line text block comprises the corresponding line text in the chemical specification, and the chemical specification comprises a plurality of pages of texts;
sorting the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
determining header line text blocks and footer line text blocks in all line text blocks corresponding to each page of text after sequencing according to the set character editing distance, and removing the header line text blocks and the footer line text blocks to obtain body line text blocks corresponding to each page of text;
acquiring a target text in the text block of the text line, a text which is positioned at the leftmost side of the text block of the text line and contains a colon and a text which is positioned in the middle of the text block of the text line as the target text, wherein the word number of the target text does not exceed a set word number threshold;
clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head and each chapter title of the chemical specification;
and determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into a chapter text, and outputting the head and each chapter text to a user terminal.
Preferably, the parsing the text of the chemical specification to obtain each line text block corresponding to each page of the text of the chemical specification includes:
dividing a text of the chemical specification into text blocks, wherein the text blocks comprise texts in corresponding areas in the chemical specification;
splitting each text block according to text lines to obtain a plurality of small text blocks;
and combining the small line text blocks corresponding to the same text line into line text blocks according to the coordinate values of the small line text blocks to obtain the line text blocks corresponding to each page of text of the chemical specification, wherein the sequence of the text in each line text block is consistent with the sequence of the text in the corresponding line in the chemical specification.
Preferably, the determining and removing a header line text block and a footer line text block in each line text block corresponding to each page of text after sorting according to the set character editing distance to obtain a body line text block corresponding to each page of text includes:
aiming at each line text block corresponding to each page of text:
acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block with the character editing distance of the text larger than a first set threshold value as a header line text block;
acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block with the character editing distance of the text larger than a second set threshold value as a footer line text block;
and removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
Preferably, the obtaining of the target text in the text block of the body line includes:
determining a small line text block at the leftmost side of each text line text block and texts contained in the small line text block in each small line text block forming each text line text block to obtain a plurality of candidate texts;
acquiring a text which contains a colon and has a word number not exceeding a set word number threshold from the candidate texts as a first target text;
and determining that the word number of the text does not exceed a set word number threshold value in each text line text block, and the text is in a target text line text block in the middle of the text line text block where the text is located, and taking the text contained in the target text line text block as a second target text.
Preferably, the determining the chapter body corresponding to each chapter title includes:
determining a text block of a text line where each chapter title is located;
taking each chapter title except the last chapter title as a current chapter title, and determining the contained text as the chapter text of the current chapter title by using each text line text block between the text line text block where the current chapter title is located and the text line text block where the next chapter title of the current chapter title is located;
and determining the text contained in each text line text block after the text block where the last chapter title is located as the chapter text of the last chapter title.
Preferably, the combining each chapter title and the corresponding chapter body into chapter text includes:
determining the text which contains a colon and is positioned at the leftmost side in the text of each chapter as a title;
for each chapter body, determining a text after the colon of each title as the body of each title;
regarding each chapter body, taking each title and the corresponding body thereof as a text paragraph, and sequencing each text paragraph according to the appearance sequence of each title in the chapter body to obtain the chapter body after the text paragraphs are sequenced;
and combining each chapter title and the sequenced chapter texts of the corresponding text paragraphs into chapter texts.
A chemical specification resolution device comprising:
the specification analyzing unit is used for analyzing texts of the chemical specification to obtain each line of text block corresponding to each page of text of the chemical specification, each line of text block comprises the text of the corresponding line in the chemical specification, and the chemical specification has a plurality of pages of text;
the line text block sequencing unit is used for sequencing the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
the text block determining unit is used for determining and removing a header text block and a footer text block in each line text block corresponding to each page of the sequenced texts according to the set character editing distance to obtain a text block of a text line corresponding to each page of the sequenced texts;
a target text acquisition unit for acquiring, as the target text, a text which is at the leftmost side of the text block of the text line where the target text is located and contains a colon, and a text which is in the middle of the text block of the line where the target text is located, the number of words of the target text not exceeding a set threshold number of words;
the chapter title acquisition unit is used for clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head of the chemical specification and each chapter title;
and the chapter text determining unit is used for determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into the chapter text, and outputting the head and each chapter text to the user terminal.
Preferably, the body line text block determination unit includes:
aiming at each line text block corresponding to each page of text:
the header line text block determining unit is used for acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block of which the character editing distance is larger than a first set threshold value as a header line text block;
the page foot line text block determining unit is used for acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block of which the character editing distance is larger than a second set threshold value as a page foot line text block;
and the text block selecting unit is used for removing the header line text block and the footer line text block and taking the rest other line text blocks as text blocks.
A chemical specification resolution device comprising a memory and a processor;
the memory is used for storing programs;
the processor is used for executing the program and realizing the steps of the chemical specification analysis method.
A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above chemical specification resolution method.
According to the scheme, the chemical specification analysis method can analyze the lines of text blocks corresponding to each page of text of the chemical specification and sequence the lines of text blocks; further, header and footer text blocks corresponding to each page of text can be removed according to the set character editing distance to obtain body line text blocks, namely, the interference of header and footer on the analysis process can be eliminated; then, acquiring a target text meeting set conditions in each text line text block, setting the set conditions according to the possible positions of the chapter titles, clustering the target text according to preset fonts, word sizes and position coordinates, and acquiring the head of a chemical specification, a specific number and the chapter titles with specific names; and finally, combining the chapter titles and the corresponding chapter texts into chapter texts, and outputting the head and the chapter texts of the chemical specification to the user terminal.
Because the writing specification of the chemical specification stipulates that the chapter titles of the specification are fixed in quantity and each chapter title has a corresponding fixed name, the clustering target for clustering the target text can be determined on the basis of the fixed name, and each chapter title can be clearly and accurately obtained.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic flow chart of a chemical specification resolution method provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a chemical specification analyzer according to an embodiment of the present disclosure;
fig. 3 is a block diagram of a hardware structure of a chemical specification parsing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a schematic flow chart of a chemical specification parsing method provided in an embodiment of the present application, where the method includes:
step S100: and analyzing the text of the chemical specification to obtain each line text block corresponding to each page of text of the chemical specification.
In particular, the chemical specification may have multiple pages of text, each page of text may be composed of multiple lines of text, and each page of text is broken up into individual blocks of lines of text, each block of lines of text may contain its corresponding line of text.
In addition, a line text block may contain information such as the font, font size, font color, line text block number, and coordinate values of set points on the line text block, in addition to the text of its corresponding line. The number of each line text block can be different and unique, and the coordinates of the set points on the line text blocks can include coordinate values of the upper left corner and the lower right corner of the line text block and coordinate values of other set points.
Step S110: and sequencing the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks.
Specifically, the line text blocks corresponding to each page of text may be sorted from top to bottom according to coordinate values of the set points at the same positions on the line text blocks, for example: and sequencing the text blocks from top to bottom according to the coordinate values of the upper left corners on the text blocks in each row, so that the sequence of the texts contained in the text blocks in each row after sequencing is the same as the sequence of the texts before chemical description analysis.
Step S120: and determining header line text blocks and footer line text blocks in the line text blocks corresponding to each page of text after sequencing according to the set character editing distance, and removing the header line text blocks and the footer line text blocks to obtain body line text blocks corresponding to each page of text.
Specifically, the header line text block and the footer line text block of each page of text may be determined and removed from each line text block after each page is sequenced according to the set character editing distance, and the remaining other line text blocks may be regarded as line text blocks corresponding to the body of each page of text, and the line text blocks corresponding to the body may be regarded as body line text blocks.
Step S130: and acquiring a target text from the text block of the text line.
Specifically, the text of the suspected chapter title can be used as the target text, the number of words of the chapter title is generally not too large, and the position of the chapter title is also generally a specific position in the text page.
Therefore, the text which is positioned at the leftmost side of the text block of the line where the text block is positioned and contains the colon can be used as the target text, and the number of words of the target text does not exceed the set word number threshold value generally. In addition, a text in the middle of the text line in which the text is located may be used as the target text, the number of words of the text included in the text block of the line corresponding to the text may not exceed the set threshold number of words, and the text is in the middle of the text block of the line.
Step S140: and clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head and each chapter title of the chemical specification.
Specifically, the font size, and the position of the chapter title are generally text different from the body text, and attribute values corresponding to attributes such as the font, the font size, and the position coordinates of the chapter title in the chemical specification may be predetermined, and based on the attribute values, the target text may be clustered, and the head and the chapter title of the chemical specification may be obtained from a plurality of target texts.
The chemical specification typically contains 16 chapters, each chapter heading in order being respectively enterprise identification, hazard identification, composition, first aid measures, fire protection measures, accidental spillage measures, operation and storage, exposure control and personal protection, physical and chemical characteristics, stability, toxicology information, ecological information, disposal precautions, transportation information, regulatory information, and other information, and the chemical specification typically has a header at the beginning.
The number and names of the section titles can be fixed, so that when the target texts are clustered, the clustering targets are quite clear, namely the 16 section titles and the headers are obtained from a plurality of target texts.
Step S150: and determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into a chapter text, and outputting the head and each chapter text to a user terminal.
Specifically, each chapter title may have a corresponding chapter body, the chapter body corresponding to each chapter title may be determined in each body line text block, each chapter title and the corresponding chapter body may be combined into a chapter text, and then each chapter content and a header of the chapter body with clear division may be output to the user terminal.
The method for analyzing the chemical specification by using the existing document analysis method has certain defects, because the structure of the prior art is flat, the multi-section and multi-level structural characteristics of the chemical specification are not considered, the method is only matched according to the mode of the characteristic text and the corresponding value thereof, and then the field is extracted, so that the interference of the same multi-section field is easily generated.
In some embodiments of the present application, the process of parsing the text of the chemical specification in step S100 to obtain the line text blocks corresponding to each page of the text of the chemical specification is introduced, and the process of obtaining the line text blocks will be described in detail below.
Specifically, the method can comprise the following steps:
s1, dividing the text of the chemical specification into text blocks, wherein the text blocks contain the text of the corresponding area in the chemical specification.
Specifically, the text block may correspond to a larger area in the chemical specification, and the text block may include text of the corresponding area, and may further include other related information, which may specifically refer to the description of the line text block in the above embodiment.
It should be noted that some chemical specifications may have pictures, and since the chemical specifications are generally in a pdf format, the coordinate values of the pictures may be obtained first, then the chemical specifications are converted from the pdf to a picture format, and then the pictures are captured from the chemical specifications in the picture format according to the coordinate values of the pictures, where the pictures may have information such as set point coordinate values and picture numbers thereon.
The captured pictures can be sorted to the corresponding positions of the corresponding text pages according to the coordinate values of the set points on the pictures, and can be output to the user terminal together with the chapter text and the head
And S2, splitting each text block according to the text lines to obtain a plurality of small text blocks.
Specifically, each text block may include multiple lines of text, and each text block may be split according to a text line, and each text block may be split into multiple small lines of text blocks. Wherein each small line text block does not necessarily contain a complete line of text.
And S3, combining the small line text blocks corresponding to the same text line into line text blocks according to the coordinate values of the small line text blocks.
Specifically, each small line text block may have a corresponding coordinate value, so that the coordinate value of the same set point on each small line text block may be determined, and the small line text blocks with the same coordinate value may be combined to obtain the line text block of the same text line, and the order of the text in each line text block obtained by the combination may be consistent with the order of the text of the corresponding line in the chemical specification.
And combining all the small lines of text blocks to obtain each line of text block corresponding to each page of text of the chemical specification.
Based on the combined line text block, the process of step S130 of the above embodiment of obtaining the target text in the text line text block is further described.
Specifically, the target texts may include a first target text and a second target text, the acquisition process of the first target text may refer to S1 and S2 described below, and the acquisition process of the second target text may refer to S3 described below.
S1, determining the small line text block at the leftmost side of each text line text block and the text contained in the small line text block to obtain a plurality of candidate texts.
Specifically, each line text block may be composed of a plurality of small line text blocks, and then the small line text block at the leftmost side in each line text block and the text included in the small line text block may be determined, and the text obtained in this process may be used as a candidate text. Because there are multiple line text blocks, multiple candidate texts are available.
S2, obtaining the text containing colon and having the word number not exceeding the set word number threshold value from the candidate texts as the first target text.
Specifically, the number of words of each candidate text may be different and different in length, and for a text containing a colon whose number of words does not exceed a set word number threshold, the text is selected from a plurality of candidate texts and determined as the first target text.
And S3, determining that the word number of the text in each text line text block does not exceed the set word number threshold, and the text is in the target text line text block in the middle of the text line text block where the text is located, and taking the text contained in the target text line text block as a second target text.
Specifically, the number of words of the text contained in some line text blocks is small, a line text block in which the number of words of the text does not exceed a set word number threshold is selected first, a target text line text block in which the text contained in the selected line text block is located in the middle of the text line text block is determined, and then the text contained in the ball-protecting target line text block can be used as a second target text.
In some embodiments of the present application, the process of determining and removing the header line text block and the footer line text block in the line text blocks corresponding to each sorted page of text according to the set character editing distance in step S120 to obtain the body line text block corresponding to each page of text is introduced, and then, the process of obtaining the body line text block will be described in detail.
The respective line text blocks corresponding to each page of text may include the steps of:
and S1, acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block with the character editing distance of the text being larger than a first set threshold value as a header line text block.
Since headers in different text pages are generally the same, differing in the number of page numbers represented, successive numbers in each line of text blocks may first be converted into one and the same token (computer terminology), i.e. a sequence of characters into a sequence of tokens (tokens), such as: both "page 3" and "page 34" can be converted to "page # NUM #. This makes it possible to make the character edit distance between headers (the minimum number of edit operations required to change from one to another between two strings) small, rather than making the character edit distance between lines of headers necessarily small, which is generally large.
Specifically, the sequence of texts contained in each line of text blocks after each page is sequenced can be the same as the text sequence of text pages in the chemical specification, so that the position of a header is generally at the upper end of the text pages, the character editing distance of the texts in each line of text blocks can be further obtained from top to bottom, and the line text blocks with the character editing distance of the texts being larger than a first set threshold value are determined to be used as the header line text blocks.
And S2, acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block with the character editing distance of the text being greater than a second set threshold value as a footer line text block.
Specifically, the above process of determining a header line text block may be referred to.
S3, removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
Specifically, after the text block at the upper end part and the text block at the lower end part of each page are removed, that is, the line text block corresponding to the header and the footer, the remaining other line text blocks can be used as the text block of the text line.
In some embodiments of the present application, the above step S150 is introduced, a process of determining the chapter body corresponding to each of the chapter titles and combining each of the chapter titles and the corresponding chapter body into the chapter text is described, and a process of determining the chapter body and combining the chapter text will be described in detail below.
Specifically, the process of determining the text of the chapter may include:
and S1, determining the text block of the text line of each chapter title.
And S2, taking each chapter title except the last chapter title as the current chapter title, and determining the text contained in each text line text block between the text line text block where the current chapter title is located and the text line text block where the next chapter title of the current chapter title is located as the chapter text of the current chapter title.
And S3, determining the text contained in each text line text block after the text block where the last chapter title is located as the chapter text of the last chapter title.
Specifically, because each text line text block corresponding to each page of text is already sequenced and according to the writing sequence of the chemical specification, in addition to the last chapter title, the text between every two chapter titles adjacent in sequence can be used as the chapter text corresponding to the previous chapter title, and the chapter text corresponding to the last chapter title can be the text contained in each text line text block after the text line text block in which the text line text block is located.
Specifically, the process of combining chapter text may include:
and S1, determining that the text of each chapter contains colon and is positioned at the leftmost side as a title.
And S2, determining the text after the colon of each title as the text of each title aiming at each chapter text.
And S3, regarding each chapter body, taking each title and the corresponding body thereof as a text paragraph, and sequencing each text paragraph according to the appearance sequence of each title in the chapter body to obtain the chapter body after the text paragraphs are sequenced.
Specifically, text containing a colon may also exist in the body corresponding to the title, and the text containing the colon in the body may also be used as the title.
Therefore, each title and the text thereof can be sequenced according to the appearance sequence of each title in the text of the chapters, and the text paragraphs can be hierarchically divided according to the appearance sequence of the titles in the sequencing, that is, each text of the chapters can be hierarchically and clearly divided, the text corresponding to the title can contain the next-level title, the text corresponding to the next-level title can also contain the next-level title, and so on until the titles of all the levels in the text of the chapters are divided.
And S4, combining the chapter texts after each chapter title and the corresponding text paragraph are sequenced into chapter texts.
According to the scheme, the chapter texts are structurally divided, so that the content of each chapter can be clarified, and the accuracy of chemical specification analysis can be improved.
The chemical specification analyzer provided in the embodiments of the present application is described below, and the chemical specification analyzer described below and the chemical specification analyzing method described above may be referred to in correspondence with each other.
First, a chemical specification analyzer will be described with reference to fig. 2, and as shown in fig. 2, the chemical specification analyzer may include:
the specification analyzing unit 100 is configured to analyze a text of a chemical specification to obtain line text blocks corresponding to each page of text of the chemical specification, where each line text block includes a text of a corresponding line in the chemical specification, and the chemical specification includes multiple pages of texts;
a line text block sorting unit 110, configured to sort, according to the coordinate value of each line text block, each line text block corresponding to each page of text from top to bottom;
a body line text block determining unit 120, configured to determine and remove a header line text block and a footer line text block in each line text block corresponding to each sequenced page of text according to the set character editing distance, so as to obtain a body line text block corresponding to each page of text;
a target text acquisition unit 130 for acquiring, as the target text, the target text in the body line text block, the text at the leftmost side of the line text block in which it is located and containing a colon, and the text at the middle of the line text block in which it is located, the number of words of the target text not exceeding a set word number threshold;
a chapter title obtaining unit 140, configured to cluster the target texts according to preset fonts, word sizes, and position coordinates to obtain a head of the chemical specification and chapter titles of the target texts;
the chapter text determining unit 150 is configured to determine a chapter text corresponding to each chapter title, combine each chapter title and the corresponding chapter text into a chapter text, and output the header and each chapter text to the user terminal.
Optionally, the text line text block determining unit may include:
aiming at each line text block corresponding to each page of text:
the header line text block determining unit is used for acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block of which the character editing distance is larger than a first set threshold value as a header line text block;
the page foot line text block determining unit is used for acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block of which the character editing distance is larger than a second set threshold value as a page foot line text block;
and the text block selecting unit is used for removing the header line text block and the footer line text block and taking the rest other line text blocks as text blocks.
Optionally, the instruction parsing unit may include:
the system comprises a text block acquisition unit, a text block processing unit and a text processing unit, wherein the text block acquisition unit is used for dividing a text of a chemical specification into text blocks, and the text blocks comprise texts in corresponding areas in the chemical specification;
the text block splitting unit is used for splitting each text block according to text lines to obtain a plurality of small text blocks;
and the small-line text block combination unit is used for combining the small-line text blocks corresponding to the same text line into line text blocks according to the coordinate values of the small-line text blocks to obtain the line text blocks corresponding to each page of text of the chemical specification, wherein the sequence of the text in each line text block is consistent with the sequence of the text of the corresponding line in the chemical specification.
Optionally, the target text acquiring unit may include:
the candidate text determining unit is used for determining the small line text block at the leftmost side of each text line text block and texts contained in the small line text blocks forming each text line text block to obtain a plurality of candidate texts;
a first target text acquisition unit configured to acquire, as a first target text, a text containing a colon and having a number of words not exceeding a set word number threshold from among the plurality of candidate texts;
and the second target text acquisition unit is used for determining that the word number of the text does not exceed a set word number threshold value in each text line text block, the text is in the target text line text block in the middle of the text line text block where the text is located, and the text contained in the target text line text block is used as the second target text.
Optionally, the chapter text determining unit may include:
the first chapter text determining subunit is used for determining a text block of a text line where each chapter title is located;
a second chapter text determination subunit, configured to use each chapter title except the last chapter title as a current chapter title, and determine, as a chapter text of the current chapter title, each text block between a text line text block where the current chapter title is located and a text line text block where a chapter title next to the chapter title is located;
and the third chapter text determining subunit is used for determining each text line text block after the text block where the last chapter title is located, and the contained text is determined as the chapter text of the last chapter title.
Optionally, the chapter text determining unit may further include:
a fourth chapter text determining subunit, configured to determine, as a title, a text that includes a colon and is located on the leftmost side in each of the chapter texts;
a fifth chapter text determining subunit, configured to determine, for each chapter text, a text following the colon of each title as a text of each title;
a sixth chapter text determining subunit, configured to, for each chapter body, use each title and a corresponding body thereof as a text paragraph, and sequence the text paragraphs according to an appearance sequence of each title in the chapter body, so as to obtain a chapter body after the text paragraphs are sequenced;
and the seventh section text determining subunit is used for combining each section title and the section text after the text paragraphs corresponding to the section title are sequenced into a section text.
The chemical specification analysis device provided by the embodiment of the application can be applied to chemical specification analysis equipment. Fig. 3 is a block diagram showing a hardware configuration of a chemical specification resolving apparatus, and referring to fig. 3, the hardware configuration of the chemical specification resolving apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;
the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;
wherein the memory stores a program and the processor can call the program stored in the memory, the program for:
analyzing a text of a chemical specification to obtain line text blocks corresponding to each page of the text of the chemical specification, wherein each line text block comprises the text of the corresponding line in the chemical specification, and the chemical specification has pages of the text;
sorting the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
determining header line text blocks and footer line text blocks in all line text blocks corresponding to each page of text after sequencing according to the set character editing distance, and removing the header line text blocks and the footer line text blocks to obtain body line text blocks corresponding to each page of text;
acquiring a target text in the text block of the text line, a text which is positioned at the leftmost side of the text block of the text line and contains a colon and a text which is positioned in the middle of the text block of the text line as the target text, wherein the word number of the target text does not exceed a set word number threshold;
clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head and each chapter title of the chemical specification;
and determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into a chapter text, and outputting the head and each chapter text to a user terminal.
Alternatively, the detailed function and the extended function of the program may be as described above.
Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:
analyzing a text of a chemical specification to obtain line text blocks corresponding to each page of the text of the chemical specification, wherein each line text block comprises the text of the corresponding line in the chemical specification, and the chemical specification has pages of the text;
sorting the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
determining header line text blocks and footer line text blocks in all line text blocks corresponding to each page of text after sequencing according to the set character editing distance, and removing the header line text blocks and the footer line text blocks to obtain body line text blocks corresponding to each page of text;
acquiring a target text in the text block of the text line, a text which is positioned at the leftmost side of the text block of the text line and contains a colon and a text which is positioned in the middle of the text block of the text line as the target text, wherein the word number of the target text does not exceed a set word number threshold;
clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head and each chapter title of the chemical specification;
and determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into a chapter text, and outputting the head and each chapter text to a user terminal.
Alternatively, the detailed function and the extended function of the program may refer to the above description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for resolving a chemical specification, comprising:
analyzing a text of a chemical specification to obtain line text blocks corresponding to each page of the text of the chemical specification, wherein each line text block comprises the text of the corresponding line in the chemical specification, and the chemical specification has pages of the text;
sorting the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
determining header line text blocks and footer line text blocks in all line text blocks corresponding to each page of text after sequencing according to the set character editing distance, and removing the header line text blocks and the footer line text blocks to obtain body line text blocks corresponding to each page of text;
acquiring a target text in the text block of the text line, a text which is positioned at the leftmost side of the text block of the text line and contains a colon and a text which is positioned in the middle of the text block of the text line as the target text, wherein the word number of the target text does not exceed a set word number threshold;
clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head and each chapter title of the chemical specification;
and determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into a chapter text, and outputting the head and each chapter text to a user terminal.
2. The method of claim 1, wherein parsing the text of the chemical specification to obtain line text blocks corresponding to each page of the text of the chemical specification comprises:
dividing a text of the chemical specification into text blocks, wherein the text blocks comprise texts in corresponding areas in the chemical specification;
splitting each text block according to text lines to obtain a plurality of small text blocks;
and combining the small line text blocks corresponding to the same text line into line text blocks according to the coordinate values of the small line text blocks to obtain the line text blocks corresponding to each page of text of the chemical specification, wherein the sequence of the text in each line text block is consistent with the sequence of the text in the corresponding line in the chemical specification.
3. The method of claim 1, wherein determining and removing a header text block and a footer text block in respective line text blocks corresponding to each ordered page of text according to the set character edit distance to obtain a body line text block corresponding to each page of text, comprises:
aiming at each line text block corresponding to each page of text:
acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block with the character editing distance of the text larger than a first set threshold value as a header line text block;
acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block with the character editing distance of the text larger than a second set threshold value as a footer line text block;
and removing the header line text block and the footer line text block, and taking the rest other line text blocks as text line text blocks.
4. The method of claim 2, wherein obtaining the target text in the body line text block comprises:
determining a small line text block at the leftmost side of each text line text block and texts contained in the small line text block in each small line text block forming each text line text block to obtain a plurality of candidate texts;
acquiring a text which contains a colon and has a word number not exceeding a set word number threshold from the candidate texts as a first target text;
and determining that the word number of the text does not exceed a set word number threshold value in each text line text block, and the text is in a target text line text block in the middle of the text line text block where the text is located, and taking the text contained in the target text line text block as a second target text.
5. The method of claim 1, wherein said determining a chapter body corresponding to each of said chapter titles comprises:
determining a text block of a text line where each chapter title is located;
taking each chapter title except the last chapter title as a current chapter title, and determining the contained text as the chapter text of the current chapter title by using each text line text block between the text line text block where the current chapter title is located and the text line text block where the next chapter title of the current chapter title is located;
and determining the text contained in each text line text block after the text block where the last chapter title is located as the chapter text of the last chapter title.
6. The method of claim 1, wherein said combining each of said chapter titles and their corresponding chapter bodies into chapter text comprises:
determining the text which contains a colon and is positioned at the leftmost side in the text of each chapter as a title;
for each chapter body, determining a text after the colon of each title as the body of each title;
regarding each section text, taking each title and the corresponding text thereof as a text paragraph, and sequencing each text paragraph according to the appearance sequence of each title in the section text to obtain the section text after the text paragraphs are sequenced;
and combining the chapter texts after sequencing each chapter title and the corresponding text paragraph into chapter texts.
7. A chemical specification analyzer, comprising:
the specification analyzing unit is used for analyzing texts of the chemical specification to obtain each line of text block corresponding to each page of text of the chemical specification, each line of text block comprises the text of the corresponding line in the chemical specification, and the chemical specification has a plurality of pages of text;
the line text block sequencing unit is used for sequencing the line text blocks corresponding to each page of text from top to bottom according to the coordinate values of the line text blocks;
the text block determining unit is used for determining and removing a header text block and a footer text block in each line text block corresponding to each page of the sequenced texts according to the set character editing distance to obtain a text block of a text line corresponding to each page of the sequenced texts;
a target text acquisition unit for acquiring, as the target text, a text which is at the leftmost side of the text block of the text line where the target text is located and contains a colon, and a text which is in the middle of the text block of the line where the target text is located, the number of words of the target text not exceeding a set threshold number of words;
the chapter title acquisition unit is used for clustering each target text according to preset fonts, word sizes and position coordinates to obtain the head of the chemical specification and each chapter title;
and the chapter text determining unit is used for determining the chapter text corresponding to each chapter title, combining each chapter title and the corresponding chapter text into a chapter text, and outputting the header and each chapter text to the user terminal.
8. The apparatus of claim 1, wherein the text block text determination unit comprises:
aiming at each line text block corresponding to each page of text:
the header line text block determining unit is used for acquiring the character editing distance of the text in each line text block from top to bottom, and determining the line text block of which the character editing distance is larger than a first set threshold value as a header line text block;
the page foot line text block determining unit is used for acquiring the character editing distance of the text in each line text block from bottom to top, and determining the line text block of which the character editing distance is larger than a second set threshold value as a page foot line text block;
and the text block selecting unit is used for removing the header line text block and the footer line text block and taking the rest other line text blocks as text blocks.
9. A chemical specification resolution device comprising a memory and a processor;
the memory is used for storing programs;
the processor, for executing the program, implementing the steps of the chemical specification parsing method according to any one of claims 1-6.
10. A readable storage medium having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the chemical specification parsing method according to any one of claims 1-6.
CN202210699721.0A 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium Active CN114997138B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210699721.0A CN114997138B (en) 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210699721.0A CN114997138B (en) 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN114997138A true CN114997138A (en) 2022-09-02
CN114997138B CN114997138B (en) 2024-07-19

Family

ID=83034943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210699721.0A Active CN114997138B (en) 2022-06-20 2022-06-20 Chemical specification analysis method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114997138B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541929A (en) * 2010-12-22 2012-07-04 北大方正集团有限公司 Method and device for extracting format file catalogue
CN105654022A (en) * 2014-11-12 2016-06-08 北大方正集团有限公司 Method and device for extracting structured document information
CN108614898A (en) * 2018-05-10 2018-10-02 爱因互动科技发展(北京)有限公司 Document method and device for analyzing
CN110704570A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Continuous page layout document structured information extraction method
CN110717323A (en) * 2019-10-17 2020-01-21 北京幻想纵横网络技术有限公司 Document seal dividing method and device, terminal and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541929A (en) * 2010-12-22 2012-07-04 北大方正集团有限公司 Method and device for extracting format file catalogue
CN105654022A (en) * 2014-11-12 2016-06-08 北大方正集团有限公司 Method and device for extracting structured document information
CN108614898A (en) * 2018-05-10 2018-10-02 爱因互动科技发展(北京)有限公司 Document method and device for analyzing
CN110704570A (en) * 2019-08-13 2020-01-17 北京众信博雅科技有限公司 Continuous page layout document structured information extraction method
CN110717323A (en) * 2019-10-17 2020-01-21 北京幻想纵横网络技术有限公司 Document seal dividing method and device, terminal and computer readable storage medium

Also Published As

Publication number Publication date
CN114997138B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
CN110427884B (en) Method, device, equipment and storage medium for identifying document chapter structure
CN107689070B (en) Chart data structured extraction method, electronic device and computer-readable storage medium
CN111291572A (en) Character typesetting method and device and computer readable storage medium
US7046847B2 (en) Document processing method, system and medium
CN112651331A (en) Text table extraction method, system, computer device and storage medium
CN113807158A (en) PDF content extraction method, device and equipment
US8526744B2 (en) Document processing apparatus and computer readable medium
US9049400B2 (en) Image processing apparatus, and image processing method and program
JP5446877B2 (en) Structure identification device
JPH11184894A (en) Method for extracting logical element and record medium
CN111539383B (en) Formula knowledge point identification method and device
JP6856916B1 (en) Information processing equipment, information processing methods and information processing programs
CN111291535B (en) Scenario processing method and device, electronic equipment and computer readable storage medium
CN110765107B (en) Question type identification method and system based on digital coding
CN109101973B (en) Character recognition method, electronic device and storage medium
CN114997138A (en) Chemical specification analysis method, device, equipment and readable storage medium
Bartík Text-based web page classification with use of visual information
CN114155547B (en) Chart identification method, device, equipment and storage medium
US6470362B1 (en) Extracting ordered list of words from documents comprising text and code fragments, without interpreting the code fragments
CN112100978B (en) Typesetting processing method based on electronic book, electronic equipment and storage medium
CN109739981B (en) PDF file type judgment method and character extraction method
CN114611501A (en) Rarely-used word detection method, device, equipment and storage medium
CN110533035B (en) Student homework page number identification method based on text matching
CN108170651B (en) Information processing method
CN112183035A (en) Text labeling method, device and equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant