CN113535892A - Industry research report searching method and device and electronic equipment - Google Patents

Industry research report searching method and device and electronic equipment Download PDF

Info

Publication number
CN113535892A
CN113535892A CN202110638917.4A CN202110638917A CN113535892A CN 113535892 A CN113535892 A CN 113535892A CN 202110638917 A CN202110638917 A CN 202110638917A CN 113535892 A CN113535892 A CN 113535892A
Authority
CN
China
Prior art keywords
industry
research report
search
industry research
feature information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110638917.4A
Other languages
Chinese (zh)
Other versions
CN113535892B (en
Inventor
李朋超
温馨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yichuangxinke Information Technology Co ltd
Original Assignee
Beijing Yichuangxinke Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yichuangxinke Information Technology Co ltd filed Critical Beijing Yichuangxinke Information Technology Co ltd
Priority to CN202110638917.4A priority Critical patent/CN113535892B/en
Publication of CN113535892A publication Critical patent/CN113535892A/en
Application granted granted Critical
Publication of CN113535892B publication Critical patent/CN113535892B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a searching method, a searching device and electronic equipment for an industry research report, wherein the method comprises the following steps: acquiring an industry research report in a pdf format; analyzing texts and paragraphs in an industry research report to obtain key feature information of the paragraphs; positioning a chart position coordinate in an industry research report, and acquiring chart key characteristic information; inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result; establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, and updating and pushing the retrieval cluster index in real time; selecting an industry word or an industry word as a search input keyword, and setting a sequencing rule at least according to an industry label scoring result; and searching the retrieval cluster index by using a distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule. The technical scheme of the invention can improve the accuracy of the search result.

Description

Industry research report searching method and device and electronic equipment
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a searching method and device for an industry research report and electronic equipment.
Background
With the continuous and rapid development of internet technology, data has penetrated every industry today and is growing at an explosive rate, becoming an important production element. In order to facilitate an industry analyst to retrieve a research report of a target industry from a plurality of industry research reports and obtain required content segments from the research report, the content in the industry research report needs to be finely structured, effective information is identified as much as possible, and the effective information is organized.
However, in the prior art, in the process of retrieving an industry research report in a PDF format, only text information in a PDF file is extracted, and then a retrieval cluster index is constructed.
Disclosure of Invention
In view of this, the invention provides a searching method and device for an industry research report and an electronic device, so as to improve the accuracy of a search result.
In a first aspect, the invention provides a search method for an industry research report, which adopts the following technical scheme:
the searching method of the industry research report comprises the following steps:
acquiring an industry research report in a pdf format;
analyzing the text and paragraph of each pdf page in the industry research report, and acquiring key feature information of the paragraph;
positioning the position coordinates of the chart in the industry research report, and acquiring key characteristic information of the chart; inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result corresponding to the industry research report;
establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, and updating and pushing the retrieval cluster index in real time;
selecting an industry word or an industry word as a search input keyword, and setting a sequencing rule at least according to the industry label scoring result;
and searching the retrieval cluster index by using a distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule.
Optionally, the parsing the text and paragraphs of each pdf page in the industry research report comprises: and analyzing texts and paragraphs of each pdf in the industry research report through a pdf, and filtering out redundant paragraphs through a preset filtering rule.
Optionally, the paragraph key feature information at least includes a title and paragraph contents of the industry research report; the chart key characteristic information at least comprises a title of the industry research report and a chart title.
Optionally, the outputting at least one industry label scoring result corresponding to the industry research report includes: and outputting a plurality of industry label scoring results corresponding to the industry research report, wherein the industry label scoring results corresponding to different industry labels are different.
Optionally, the updating the retrieval cluster index in real time includes: and (3) consuming mysql binlog in real time through canal, and isomerizing data obtained by canal consumption to a retrieval cluster or a message queue through an adapter to finish the real-time updating of the retrieval cluster index.
Optionally, the selecting an industry word or an industry word as the search input keyword includes: and adding a synonym library, and taking the industry synonyms or the industry synonyms in the synonym library as search input keywords.
Optionally, the setting a ranking rule according to at least the industry label scoring result includes: and respectively setting different weights for the industry label scoring result, the fragment type and the text according to the search service side emphasis, and setting a sequencing rule according to the weights.
Optionally, the searching the retrieval cluster index by using a distributed service architecture includes: and calling at least one search application program interface through the service registration center to complete the search of the retrieval cluster index.
In a second aspect, the present invention provides a searching apparatus for industry research reports, which adopts the following technical solutions:
the searching device of the industry research report comprises:
the acquisition module is used for acquiring an industry research report in a pdf format;
the analysis module is used for analyzing the text and the paragraph of each pdf in the industry research report and acquiring key feature information of the paragraph; positioning the position coordinates of the chart in the industry research report, and acquiring key characteristic information of the chart; inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result corresponding to the industry research report;
the search module is used for establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, updating and pushing the retrieval cluster index in real time; selecting an industry word or an industry word as a search input keyword, and setting a sequencing rule at least according to the industry label scoring result; and searching the retrieval cluster index by using a distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule.
In a third aspect, the present invention provides an electronic device, which adopts the following technical solutions:
the electronic device comprises a processor, and a memory coupled to the processor, the memory storing a computer program executable by the processor, the processor implementing the method of searching for an industry research report as described in any one of the above when executing the computer program.
In the method for searching the industrial research report, on one hand, the accuracy of a search result can be effectively improved by selecting industrial words or industrial words as search input keywords, on the other hand, data such as paragraphs, charts, texts and the like are structurally extracted from the industrial research report in a pdf format, so that the representation form of the search result can be enriched, information fragments can be quickly positioned, on the other hand, the analysis speed is increased by combining a distributed analysis framework, real-time retrieval cluster index updating and a distributed service framework, and a high-availability service framework is provided for providing basic service capability for more scenes.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a method for searching an industry research report provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a process from step S1 to step S4 according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a process from step S5 to step S7 according to an embodiment of the present invention;
FIG. 4 is a schematic illustration of a page in an industry research report provided by an embodiment of the present invention;
FIG. 5 is a block diagram of an apparatus for searching industry research reports provided in accordance with an embodiment of the present invention;
fig. 6 is an overall architecture diagram of a service platform including a search apparatus according to an embodiment of the present invention.
Detailed Description
The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them. Other embodiments may incorporate structural, logical, electrical, process, and other changes. The examples merely typify possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in or substituted for those of others. The scope of embodiments of the invention encompasses the full ambit of the claims, as well as all available equivalents of the claims. Embodiments of the invention may be referred to herein, individually or collectively, by the term "invention" merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed.
Specifically, as shown in fig. 1, fig. 2 and fig. 3, fig. 1 is a flowchart of a searching method for an industry research report provided by an embodiment of the present invention, fig. 2 is a schematic process diagram of steps S1 to S4 provided by an embodiment of the present invention, and fig. 3 is a schematic process diagram of steps S5 to S7 provided by an embodiment of the present invention, and it should be noted that if substantially the same result exists, the searching method in the present invention is not limited to the sequence of the flows shown in fig. 1 to fig. 3. The searching method of the industry research report comprises the following steps:
and step S1, acquiring an industry research report in pdf format.
The above industry research report is not limited except that the document format is limited to pdf, and may be any report for analyzing and researching the related content of a specific industry, and is usually compiled by a dealer or an economic consulting organization. For example, the industry research report is a financial industry research report, a new energy industry research report, an artificial intelligence industry research report, an intelligent manufacturing industry research report, a new material industry research report, an environmental protection industry research report, a biomedical industry research report, and the like. It should be added that all reports taking industry as research subject can be understood as the industry research reports mentioned in the embodiments of the present invention, and are determined by actual contents, not by titles/names.
In step S1, the pdf-formatted industry research report may be obtained by purchasing from a data company and periodically updating, or may be obtained by directional crawling through a vertical website. For example, the industry research report in the PDF format is obtained through the task scheduling framework and the message queue MQ, and the obtained industry research report in the PDF format is synchronized into the mysql database and/or the FTP server at regular time. At the same time of acquiring the pdf document of the industry research report in step S1, the title of the industry research report may also be directly acquired.
And step S2, analyzing the text and paragraph of each pdf in the industry research report, and acquiring key feature information of the paragraph.
Fig. 4 is a schematic diagram of a page in an industry research report provided by an embodiment of the present invention, which illustratively parses text and paragraphs of each pdf in the industry research report through a pdf miner.
Furthermore, after the text and the paragraphs are analyzed, redundant paragraphs can be filtered out through a preset filtering rule, so that the subsequent steps are simplified, and the data storage requirement is reduced. According to the embodiment of the invention, the filtering rule can be formulated according to at least one of the text length, the special character, the position and the like. For example, a paragraph in the filter rule with a text length smaller than a specific value is defined as a redundant paragraph, or a paragraph in the filter rule with a special character (e.g., legal risk, time, date, etc.) is defined as a redundant paragraph, or a paragraph in the filter rule with a header and a footer of each pdf is defined as a redundant paragraph. The selection can be performed by those skilled in the art according to actual needs, and the detailed description is omitted here.
Optionally, the paragraph key feature information in the embodiment of the present invention at least includes a title and a paragraph content of an industry research report. The title of the industry research report may be obtained in step S1, or may be obtained after parsing in step S2.
And S3, positioning the position coordinates of the chart in the industry research report, and acquiring the key characteristic information of the chart.
In step S3, the position coordinates of the chart may be located by way of chart title location, and then the chart is captured by way of screenshot through a PIL (Python Image Library) and key feature information of the chart is obtained. The chart key characteristic information at least comprises a title of an industry research report and a chart title. The title of the industry research report may be obtained in step S1, or may be obtained after parsing in step S2.
And step S4, inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result corresponding to the industry research report.
Where industry label models can be built in advance and trained over a lexicon that includes a large number (e.g., millions) of industry words or industry words.
Illustratively, when the paragraph key feature information comprises a title and a paragraph content of an industry research report, and the chart key feature information comprises a title and a chart title of the industry research report, the paragraph takes the title + the paragraph content as an input of an industry label model, the chart takes the title + the chart title as an input of the industry label model, and at least one industry label scoring result corresponding to the industry research report is output. The "tag" in the "tagged fragment data" in fig. 2 and 3 refers to the output here-the scoring result of at least one industry tag; "clip data" refers to a chart title or paragraph contents.
Optionally, outputting at least one industry label scoring result corresponding to the industry research report includes: and outputting a plurality of industry label scoring results corresponding to the industry research report, wherein the industry label scoring results corresponding to different industry labels are different. Taking the financial industry research report as an example, the output result can be { 'financial industry': 0.18991, 'financial regulation': 0.089101, 'economic data': 0.0011}. This shows that the research report of the financial industry is related to the three search keywords of the financial industry, financial supervision and economic data, but the matching degree is reduced in turn.
And step S5, establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, and updating and pushing the retrieval cluster index in real time.
The specific way of retrieving the cluster index and mapping the fields may refer to the database structure.
Optionally, updating the retrieval cluster index in real time includes: and consuming the binary log binlog of mysql in real time through the message middleware canal, and isomerizing data obtained by canal consumption to a retrieval cluster ES or a message queue MQ through an adapter to finish the real-time updating of the retrieval cluster index.
And step S6, selecting industry words or industry words as search input keywords, and setting a sequencing rule at least according to the industry label scoring result.
Optionally, selecting an industry word or an industry word as the search input keyword includes: and adding a synonym library, and taking the industry synonyms or the industry synonyms in the synonym library as search input keywords.
Optionally, the setting of the sorting rule at least according to the industry label scoring result includes: according to the search service emphasis point, different weights are set for the industry label scoring result, the fragment type (including paragraph and picture) and the text respectively, and a sequencing rule is set according to the weights. For example, if the search service emphasis is the matching degree, the weight of the industry label scoring result in the sorting rule is the highest; if the search service side focuses on the visual display of the picture, the weight of the fragment type in the sequencing rule is the highest; and if the search service emphasis is abstract introduction, the weight of the text in the sorting rule is the highest. The above is merely an example, and a person skilled in the art can set the sorting rule according to actual needs.
And step S7, searching the retrieval cluster index by using the distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule.
Optionally, the searching the retrieval cluster index by using the distributed service architecture includes: and calling at least one search application program interface search api through the service registration center to complete the search of the retrieval cluster index.
In the searching method of the industrial research report, on one hand, the accuracy of the searching result can be effectively improved by selecting the industrial words or the industrial words as the searching input key words, on the other hand, the data of paragraphs, charts, texts and the like are structurally extracted through the industrial research report in the pdf format, so that the representation form of the searching result can be enriched, the information fragments can be quickly positioned, on the other hand, the resolving speed is increased through the combination of a distributed resolving framework, the real-time retrieval cluster index updating and the distributed service framework, the high-availability service framework is provided, and the basic service capability is provided for more scenes.
In addition, the present invention provides an apparatus for searching an industrial research report, and specifically, as shown in fig. 3, fig. 5 is a block diagram of an apparatus for searching an industrial research report according to an embodiment of the present invention, where the apparatus for searching an industrial research report includes:
the acquisition module 10 is used for acquiring an industry research report in pdf format;
the analysis module 20 is used for analyzing the text and the paragraph of each pdf in the industrial research report and acquiring key feature information of the paragraph; positioning a chart position coordinate in an industry research report, and acquiring chart key characteristic information; inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result corresponding to an industry research report;
the search module 30 is configured to establish a retrieval cluster index and a mapping field according to the text, the paragraph key feature information, and the chart key feature information, and update the retrieval cluster index in real time; selecting industry words or industry words as search input keywords, and setting a sequencing rule at least according to industry label scoring results; and searching the retrieval cluster index by using a distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule.
Illustratively, the obtaining module 10 obtains the pdf-formatted industrial research report through purchasing from a data company and periodic updating, or obtains the pdf-formatted industrial research report through directional crawling on a vertical website. For example, the obtaining module 10 obtains an industry research report in pdf format through the task scheduling framework and the message queue MQ, and synchronizes the obtained industry research report in pdf format to the mysql database and/or the FTP server at regular time. The obtaining module 10 can also directly obtain the title of the industry research report while obtaining the pdf document of the industry research report.
The parsing module 20 may include a first parsing unit, a second parsing unit and a third parsing unit, and the specific functions of each parsing unit are as follows:
the first analysis unit is used for analyzing the text and the paragraph of each pdf in the industrial research report and acquiring the key feature information of the paragraph. The paragraph key feature information at least comprises the title and the paragraph content of an industry research report. The title of the industry research report can be obtained by the obtaining module 10, and also can be obtained by the first parsing unit. The first parsing unit may be further configured to filter out redundant paragraphs through a preset filtering rule after parsing out the text and the paragraphs.
The second analysis unit is used for positioning the position coordinates of the chart in the industry research report and acquiring the key characteristic information of the chart. Specifically, the second parsing unit locates the position coordinates of the graph by means of graph title location, and then captures the graph by means of a PIL (Python Image Library) in a screenshot manner, and obtains key feature information of the graph. The chart key characteristic information at least comprises a title of an industry research report and a chart title.
And the third analysis unit is used for inputting the paragraph key feature information and the chart key feature information into the industry label model and outputting at least one industry label scoring result corresponding to the industry research report. When the paragraph key feature information comprises a title and a paragraph content of an industry research report, and the chart key feature information comprises a title and a chart title of the industry research report, the third parsing unit is used for inputting the title + the paragraph content and the title + the chart title into the industry label model and outputting at least one industry label scoring result corresponding to the industry research report. Optionally, the third parsing unit outputs a plurality of industry label scoring results corresponding to the industry research report, where the industry label scoring results corresponding to different industry labels are different.
The searching module 30 may include a first searching unit, a second searching unit and a third searching unit, and the specific functions of each searching unit are as follows:
the first searching unit is used for establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, and updating and pushing the retrieval cluster index in real time. Specifically, the first search unit consumes mysql binlog in real time through canal, and completes real-time updating of the index of the search cluster in a mode that the adapter isomerizes data obtained by canal consumption to the search cluster ES or the message queue MQ.
The second searching unit is used for selecting industry words or industry words as search input keywords and setting a sequencing rule at least according to industry label scoring results. The specific way of selecting the industry word or the industry word as the search input keyword by the second search unit is as follows: and adding a synonym library, and taking the industry synonyms or the industry synonyms in the synonym library as search input keywords. The specific mode that the second searching unit sets the sequencing rule at least according to the industry label scoring result is as follows: according to the search service emphasis point, different weights are set for the industry label scoring result, the fragment type (including paragraph and picture) and the text respectively, and a sequencing rule is set according to the weights.
The third searching unit is used for searching the retrieval cluster index by using the distributed service architecture, determining a searching result through the mapping field, and displaying the searching result according to the sorting rule. The specific way of searching the retrieval cluster index by the third search unit by using the distributed service architecture is as follows: and calling at least one search application program interface search api through the service registration center to complete the search of the retrieval cluster index.
Fig. 6 is an overall architecture diagram of a service platform including a search apparatus according to an embodiment of the present invention, as shown in fig. 6, in the service platform, an acquisition module of the search apparatus is located at a first layer of the service platform and is used for acquiring an industry data source, an analysis module of the search apparatus is located at a second layer of the service platform and provides an industry information analysis framework for the entire service platform, the second layer is connected to the first layer through an ftp server, a search module of the search apparatus is located at a third layer of the service platform and is used as a search platform, the third layer is connected to the second layer through a mysql database, the service platform further includes a fourth SAAS platform, and the fourth layer is connected to the third layer through a gateway.
In addition, an embodiment of the present invention further provides an electronic device, where the electronic device includes a processor, and a memory coupled to the processor, where the memory stores a computer program executable by the processor, and the processor, when executing the computer program, implements the search method for an industry research report in any one of the above.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for searching an industry research report, comprising:
acquiring an industry research report in a pdf format;
analyzing the text and paragraph of each pdf page in the industry research report, and acquiring key feature information of the paragraph;
positioning the position coordinates of the chart in the industry research report, and acquiring key characteristic information of the chart;
inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result corresponding to the industry research report;
establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, and updating and pushing the retrieval cluster index in real time;
selecting an industry word or an industry word as a search input keyword, and setting a sequencing rule at least according to the industry label scoring result;
and searching the retrieval cluster index by using a distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule.
2. The method of searching for an industry research report of claim 1 wherein parsing the text and paragraphs of each pdf page of the industry research report comprises: and analyzing texts and paragraphs of each pdf in the industry research report through a pdf, and filtering out redundant paragraphs through a preset filtering rule.
3. The method for searching industry research report according to claim 1, wherein the paragraph key feature information at least comprises a title and a paragraph content of the industry research report; the chart key characteristic information at least comprises a title of the industry research report and a chart title.
4. The method for searching for an industry research report according to claim 1, wherein said outputting at least one industry label scoring result corresponding to the industry research report comprises: and outputting a plurality of industry label scoring results corresponding to the industry research report, wherein the industry label scoring results corresponding to different industry labels are different.
5. The industry research report search method of claim 1, wherein said updating the search cluster index in real-time comprises: and (3) consuming mysql binlog in real time through canal, and isomerizing data obtained by canal consumption to a retrieval cluster or a message queue through an adapter to finish the real-time updating of the retrieval cluster index.
6. The method of searching for an industry research report of claim 1, wherein selecting an industry word or an industry word as a search input keyword comprises: and adding a synonym library, and taking the industry synonyms or the industry synonyms in the synonym library as search input keywords.
7. The method of searching for an industry research report of claim 1, wherein the setting of a ranking rule according to at least the industry label scoring result comprises: and respectively setting different weights for the industry label scoring result, the fragment type and the text according to the search service side emphasis, and setting a sequencing rule according to the weights.
8. The industry research report search method of claim 1, wherein the searching the search cluster index using a distributed service architecture comprises: and calling at least one search application program interface through the service registration center to complete the search of the retrieval cluster index.
9. An apparatus for searching an industrial research report, comprising:
the acquisition module is used for acquiring an industry research report in a pdf format;
the analysis module is used for analyzing the text and the paragraph of each pdf in the industry research report and acquiring key feature information of the paragraph; positioning the position coordinates of the chart in the industry research report, and acquiring key characteristic information of the chart; inputting the paragraph key feature information and the chart key feature information into an industry label model, and outputting at least one industry label scoring result corresponding to the industry research report;
the search module is used for establishing a retrieval cluster index and a mapping field according to the text, the paragraph key feature information and the chart key feature information, updating and pushing the retrieval cluster index in real time; selecting an industry word or an industry word as a search input keyword, and setting a sequencing rule at least according to the industry label scoring result; and searching the retrieval cluster index by using a distributed service architecture, determining a search result through the mapping field, and displaying the search result according to the sorting rule.
10. An electronic device comprising a processor, and a memory coupled to the processor, the memory storing a computer program executable by the processor, the processor implementing the method of searching for an industry research report of any one of claims 1-8 when executing the computer program.
CN202110638917.4A 2021-06-08 2021-06-08 Search method and device for industry research report and electronic equipment Active CN113535892B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110638917.4A CN113535892B (en) 2021-06-08 2021-06-08 Search method and device for industry research report and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110638917.4A CN113535892B (en) 2021-06-08 2021-06-08 Search method and device for industry research report and electronic equipment

Publications (2)

Publication Number Publication Date
CN113535892A true CN113535892A (en) 2021-10-22
CN113535892B CN113535892B (en) 2023-12-01

Family

ID=78124697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110638917.4A Active CN113535892B (en) 2021-06-08 2021-06-08 Search method and device for industry research report and electronic equipment

Country Status (1)

Country Link
CN (1) CN113535892B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120191716A1 (en) * 2002-06-24 2012-07-26 Nosa Omoigui System and method for knowledge retrieval, management, delivery and presentation
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN111177532A (en) * 2019-12-02 2020-05-19 平安资产管理有限责任公司 Vertical search method, device, computer system and readable storage medium
CN111666383A (en) * 2020-06-30 2020-09-15 腾讯科技(深圳)有限公司 Information processing method, information processing device, electronic equipment and computer readable storage medium
CN111680180A (en) * 2020-05-26 2020-09-18 广州多益网络股份有限公司 Text boxed display method and device for chart search
CN112035723A (en) * 2020-08-28 2020-12-04 光大科技有限公司 Resource library determination method and device, storage medium and electronic device
CN112149387A (en) * 2020-09-28 2020-12-29 深圳壹账通智能科技有限公司 Visualization method and device for financial data, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120191716A1 (en) * 2002-06-24 2012-07-26 Nosa Omoigui System and method for knowledge retrieval, management, delivery and presentation
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN111177532A (en) * 2019-12-02 2020-05-19 平安资产管理有限责任公司 Vertical search method, device, computer system and readable storage medium
CN111680180A (en) * 2020-05-26 2020-09-18 广州多益网络股份有限公司 Text boxed display method and device for chart search
CN111666383A (en) * 2020-06-30 2020-09-15 腾讯科技(深圳)有限公司 Information processing method, information processing device, electronic equipment and computer readable storage medium
CN112035723A (en) * 2020-08-28 2020-12-04 光大科技有限公司 Resource library determination method and device, storage medium and electronic device
CN112149387A (en) * 2020-09-28 2020-12-29 深圳壹账通智能科技有限公司 Visualization method and device for financial data, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
EWELINA LACKA ET AL.: "Technological advancements and B2B international trade: A bibliometric analysis and review of industrial marketing research", 《INDUSTRIAL MARKETING MANAGEMENT》, pages 1 - 11 *
吴淙: "中文文本校对关键技术研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 138 - 2659 *
王俭: "基于SOR理论的消费者感知在线评论有用性形成机理与评价研究", 《中国博士学位论文全文数据库 经济与管理科学辑》, pages 152 - 76 *

Also Published As

Publication number Publication date
CN113535892B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
Deepak et al. A novel firefly driven scheme for resume parsing and matching based on entity linking paradigm
CN108932294B (en) Resume data processing method, device, equipment and storage medium based on index
CN110019616B (en) POI (Point of interest) situation acquisition method and equipment, storage medium and server thereof
CN106844640B (en) Webpage data analysis processing method
KR20200007969A (en) Information processing methods, terminals, and computer storage media
CN112307762B (en) Search result sorting method and device, storage medium and electronic device
CN108959236B (en) Medical literature classification model training method, medical literature classification method and device thereof
US10261992B1 (en) System and method for actionizing patient comments
CN110795932B (en) Geological report text information extraction method based on geological ontology
CN111428503B (en) Identification processing method and processing device for homonymous characters
Das et al. A CV parser model using entity extraction process and big data tools
CN110929119A (en) Data annotation method, device, equipment and computer storage medium
CN111626568A (en) Knowledge base construction method and device and knowledge search method and system
CN114595686A (en) Knowledge extraction method, and training method and device of knowledge extraction model
CN112328806A (en) Data processing method, system, computer equipment and storage medium
Maciołek et al. Cluo: Web-scale text mining system for open source intelligence purposes
CN113742496B (en) Electric power knowledge learning system and method based on heterogeneous resource fusion
US20170235835A1 (en) Information identification and extraction
CN114996549A (en) Intelligent tracking method and system based on active object information mining
CN113706253A (en) Real-time product recommendation method and device, electronic equipment and readable storage medium
CN108875014B (en) Precise project recommendation method based on big data and artificial intelligence and robot system
CN113434627A (en) Work order processing method and device and computer readable storage medium
CN113535892B (en) Search method and device for industry research report and electronic equipment
JP6763967B2 (en) Data conversion device and data conversion method
CN114925125A (en) Data processing method, device and system, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant