WO2016111007A1 - Data analysis system, data analysis system control method, and data analysis system control program - Google Patents

Data analysis system, data analysis system control method, and data analysis system control program Download PDF

Info

Publication number
WO2016111007A1
WO2016111007A1 PCT/JP2015/050517 JP2015050517W WO2016111007A1 WO 2016111007 A1 WO2016111007 A1 WO 2016111007A1 JP 2015050517 W JP2015050517 W JP 2015050517W WO 2016111007 A1 WO2016111007 A1 WO 2016111007A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
unit
predetermined
evaluation
analysis system
Prior art date
Application number
PCT/JP2015/050517
Other languages
French (fr)
Japanese (ja)
Inventor
守本 正宏
秀樹 武田
和巳 蓮子
Original Assignee
株式会社Ubic
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Ubic filed Critical 株式会社Ubic
Priority to PCT/JP2015/050517 priority Critical patent/WO2016111007A1/en
Publication of WO2016111007A1 publication Critical patent/WO2016111007A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to a data analysis system, a data analysis system control method, and a data analysis system control program.
  • a user is related to a lawsuit with respect to an extracted unit that extracts a predetermined number of documents from collected document information, a document display unit that displays the extracted document group on the screen, and the displayed document group.
  • a classification code receiving unit that receives a classification code assigned based on the classification code, and classifies the extracted document group for each classification code based on the classification code, and analyzes keywords that appear in common in the classified document group
  • a selection part to be selected a database for recording the selected keyword, a search part for searching the keyword recorded in the database from document information, a search result of the search part and an analysis result of the selection part,
  • a document classification system including a score calculation unit that calculates a score indicating relevance with a document, and an automatic classification unit that automatically assigns a classification code based on the result of the score ( In example, see Patent Document 1.). According to the document separation system described in Patent Document 1, it is possible to analyze digitized document information collected for submission as evidence in a lawsuit and separate it so that it can be easily used in a
  • a document In a document classification system as described in Patent Document 1, a document may be randomly sampled from document information to be classified, and a reviewer may manually extract a document group to which a classification code is to be assigned.
  • a reviewer When extracting manually, the data extraction accuracy cannot be improved automatically. Therefore, it is desirable to prevent the deterioration of the learning accuracy by automatically separating the digital information and reducing the deterioration of the automatic processing accuracy in the separation processing.
  • an object of the present invention is to provide a data analysis system, a data analysis method, and a data analysis program that can reduce deterioration in automatic processing accuracy of the separation process.
  • a data analysis system analyzes whether or not a pattern extracted from training data is included in unknown data, thereby determining the unknown data and a predetermined case.
  • a data analysis system for evaluating relevance, a data evaluation unit for evaluating relevance based on an extracted pattern, and a criterion for determining whether an evaluation result by the data evaluation unit satisfies a predetermined standard The determination unit, the data selection unit that selects a predetermined number of data from the unknown data that has been evaluated for relevance by the data evaluation unit, and the relationship between the data selected by the data selection unit and the predetermined case are selected.
  • the data re-evaluation unit that re-evaluates based on the patterns included in the data and the reference determination unit based on the re-evaluation result by the data re-evaluation unit.
  • a reference change unit for changing the predetermined reference that.
  • the data analysis system can be implemented with at least the following three configurations. That is, the data analysis system is implemented in a configuration in which (a) a part or all of a data analysis program for realizing the data analysis system is executed in a client device (for example, a user terminal such as a personal computer or a smartphone). (B) A part or all of the data analysis program is executed in a server device (for example, a mainframe, a cluster computer, an arbitrary computer capable of providing services provided by the system to an external device, etc.) The executed result may be returned to the client device, or (c) the processing included in the data analysis program may be executed arbitrarily in the client device and the server device. Good. In other words, it is sufficient that the data analysis system is realized as a system composed of at least one computer, and each function included in the data analysis system is realized by arbitrarily sharing the functions of the computer constituting the system. Can be done.
  • the data analysis system further includes a parameter determination unit that determines a parameter of the model assumed for the distribution of the re-evaluation result based on the re-evaluation result, and the reference changing unit includes: For example, a predetermined criterion may be changed using a model identified by the determined parameter.
  • the model may be a function belonging to an exponential distribution family, for example, and the parameter determination unit may determine an exponent parameter of the function.
  • the data re-evaluation unit evaluates the relationship by, for example, calculating a score indicating the strength of the connection between the selected data and the predetermined case.
  • the predetermined criterion is based on whether or not a threshold value set for the score is exceeded, and the reference changing unit is determined by rearranging the threshold value according to a predetermined rule.
  • the predetermined reference may be changed by changing from another rank to another threshold value that can be uniquely specified.
  • the data analysis system includes a generation unit that generates, for each partial data, appearance information indicating whether or not a predetermined data element appears in the partial data included in the data, and a generation unit A relationship reflecting unit that obtains relationship information for each partial data by reflecting the relationship between the predetermined data element and another data element different from the predetermined data element in the appearance information generated by
  • the data evaluation unit and / or the data reevaluation unit may evaluate the degree of association based on the relationship information obtained by the relationship reflection unit, for example.
  • the data includes, for example, at least a user's evaluation of the event, and is an emotion of the user who generated the data, and for the event that occurred based on the evaluation.
  • An emotion extraction unit that extracts emotion from the data may be further provided, and the data evaluation unit and / or the data reevaluation unit may evaluate the degree of association based on the extraction result by the emotion extraction unit.
  • the data analysis system can summarize partial data including the predetermined pattern by referring to a database that stores the predetermined pattern and a concept included in the predetermined pattern in association with each other. It may further include a concept extraction unit that extracts a high-level concept from the partial data.
  • the database only needs to be a storage medium readable and writable connected to the computer that implements the data analysis system.
  • the hard disk connected to the computer, the computer and a predetermined network (for example, Widely includes a file server communicably connected via the Internet, an intranet, etc., a memory provided in another computer communicably connected to the computer via the predetermined network, and the like.
  • the data analysis system includes a data extraction unit that extracts a predetermined number of data from a data set stored in a database as training data, and classification information indicating the classification of the training data.
  • the data analysis system further includes a classification information receiving unit that receives from the user via the apparatus and a data classification unit that classifies the training data by associating the received classification information with the training data. A pattern may be extracted from the training data based on the classified result.
  • the data analysis system includes: (a) a client device that is communicably connected to a computer that implements the data analysis system via a predetermined network; and a predetermined input device (for example, a keyboard, a mouse, etc.)
  • the client device transmits the classification information input via the predetermined input device to the computer, and a classification information receiving unit realized on the computer acquires the classification information.
  • the computer may include the predetermined input device, and the classification information receiving unit may acquire the classification information.
  • the data evaluation unit and / or the data re-evaluation unit evaluates the relationship for each phase, which is an index indicating each stage in which a predetermined action progresses, for example.
  • the standard determination unit determines, for each phase, whether or not the predetermined standard provided for each phase is satisfied, and the reference change unit determines the predetermined standard provided for each phase. Each may be changed.
  • the data analysis system includes a data element extraction unit that extracts data elements from the data classified by the data classification unit, and an element evaluation unit that evaluates the data elements according to a predetermined element criterion.
  • the data evaluation unit and / or the data reevaluation unit may evaluate the relevance by using, for example, the data element evaluated by the element evaluation unit.
  • the element evaluation unit for example, the element evaluation unit includes transmission information representing a dependency relationship between the data element and the classification information associated with the data including the data element.
  • the data element may be evaluated by using the quantity as one of the predetermined element criteria.
  • the data analysis system includes a data element stored in a database in association with the data element extracted by the data element extraction unit and the result of evaluation of the data element by the element evaluation unit
  • a storage unit may be further provided.
  • the data evaluation unit and / or the data reevaluation unit determines the relationship based on the correlation between the first data element and the second data element included in the data. You may evaluate.
  • the data includes, for example, at least a part of text
  • the data evaluation unit and / or the data reevaluation unit includes a sentence or a paragraph included in the text and a predetermined case.
  • the relationship between the data and the predetermined case may be evaluated based on the evaluation result.
  • the data analysis method analyzes whether or not a pattern extracted from training data is included in unknown data, and A data analysis system control method for evaluating relevance to a case, in which a data evaluation step for evaluating relevance based on an extracted pattern and whether or not an evaluation result in the data evaluation step satisfies a predetermined standard
  • a data re-evaluation step that re-evaluates based on the patterns contained in the selected data; Based on the re-evaluation result at flop, and a reference changing step of changing a predetermined reference.
  • the data analysis program analyzes whether or not a pattern extracted from training data is included in the unknown data, thereby determining the unknown data and the predetermined data.
  • a data analysis system control program that evaluates the relevance of a case.
  • the data evaluation function evaluates the relevance based on the extracted pattern on the computer, and the evaluation result of the data evaluation function satisfies a predetermined standard.
  • a criterion determination function for determining whether or not the data is selected, a data selection function for selecting a predetermined number of data from unknown data whose relevance is evaluated by the data evaluation function, data selected by the data selection function, and a predetermined case
  • a data re-evaluation function that re-evaluates the relevance of the data based on the pattern contained in the selected data, and data re-evaluation Based on the reevaluation result of ability to achieve a standard changing function of changing a predetermined reference.
  • the data analysis method, and the data analysis program according to the present invention it is possible to provide a data analysis system, a data analysis method, and a data analysis program that can reduce deterioration of the automatic processing accuracy of the separation process.
  • the data analysis system 1 is a system that evaluates the relevance between the unknown data and a predetermined case by analyzing whether or not the pattern extracted from the training data is included in the unknown data. is there. More specifically, the data analysis system 1 uses, for example, data stored in an information processing apparatus such as a user terminal or a server (for example, data related to an e-mail or a document) between a predetermined case and each data. The system evaluates whether or not the data is related to the predetermined case according to whether or not the evaluation result of the relevance satisfies a predetermined standard.
  • the predetermined case is information indicating that it relates to, for example, a lawsuit, a cartel, an illegal act, and / or a stage of progress thereof.
  • the training data and / or unknown data (hereinafter sometimes referred to as “data”) is, for example, text data (e-mail, presentation material, spreadsheet material, meeting material, contract, organization chart, business) Data including at least a part of text such as a plan document), and the pattern may be a keyword (for example, morpheme) included in the text data (for example, sentence, paragraph, etc.). ).
  • the data analysis system 1 for example, is an electronic device required for investigating and investigating the cause of a crime or dispute when a crime or legal dispute related to a computer such as unauthorized access or leakage of confidential information occurs. It can be applied to a forensic system (or a discovery support system that supports discovery procedures in US civil lawsuits), which is a technology that collects and analyzes data that is records and reveals its legal evidence.
  • a forensic system or a discovery support system that supports discovery procedures in US civil lawsuits
  • the data analysis system 1 first extracts a data group including a predetermined number of data as training data from a database storing e-mails and the like.
  • the data analysis system 1 associates classification information indicating a predetermined data classification (whether or not it is related to a predetermined case) with all or a part of the data included in the extracted data group.
  • the data analysis system 1 evaluates the relevance (that is, the relevance between the unknown data and a predetermined case) between the data (mainly unknown data to which the classification information is not yet associated) and the classification information. .
  • the data analysis system 1 scores each piece of data using a keyword in which weighting information corresponding to the degree of relevance with classification information is associated in advance.
  • the data analysis system 1 For example, if the data analysis system 1 is realized as a “forensic system”, the degree of relevance is “degree of document (data) related to the lawsuit”. For example, the data analysis system 1 associates a higher score with data as the relationship between the classification information (predetermined case) and the data is higher.
  • the data analysis system 1 may previously include a keyword database that stores a plurality of keywords related to the classification information and weighting information associated with each of the plurality of keywords.
  • the weighting information is information that becomes more weighted as the relevance of the associated keyword classification information is higher.
  • the data analysis system 1 scores each of a plurality of data using this weighting information.
  • the data analysis system 1 determines whether or not the relevance evaluation result satisfies a predetermined criterion (for example, a value calculated by an exponential function using a score associated with the data as a variable (hereinafter, (It is sometimes called “rank”) whether or not it exceeds a predetermined rank), and when the predetermined criteria are satisfied, the extracted data is related to the classification of the data indicated by the classification information Judge. Further, the data analysis system 1 re-extracts a predetermined number of data from the data whose relevance is evaluated, and re-evaluates the relevance between the re-extracted data and the classification information. The data analysis system 1 changes the above-mentioned predetermined standard according to the re-evaluation result. The data analysis system 1 repeats the data extraction, the evaluation of the relevance between the data and the classification information, and the determination as to whether or not the evaluation result satisfies a predetermined standard. Extraction accuracy can be improved.
  • a predetermined criterion for example, a value calculated by
  • the server is one or more servers, and may be configured to include a plurality of servers.
  • the server includes a server capable of storing digital information such as a mail server, a file server, or a document management server.
  • a user terminal is one or more user terminals, Comprising: A several user terminal can also be comprised.
  • the user terminal includes a personal computer, a notebook computer, a tablet PC, or a mobile communication terminal such as a mobile phone.
  • FIG. 1 shows an example of functional configuration blocks of the data analysis system according to the present embodiment.
  • the data analysis system 1 includes a database 100 that stores data, a data extraction unit 102 that extracts predetermined data from the database 100, an input device 104 that inputs information such as classification information, and a classification information reception unit that receives classification information 106, a data classification unit 108 that associates classification information with data, a data element extraction unit 110 that extracts data elements from data, an element evaluation unit 112 that performs evaluation of data elements, and the relationship between data and classification information
  • the data analysis system 1 includes a generation unit that generates, for each partial data, appearance information indicating whether or not a predetermined data element appears in the partial data included in the data, and the appearance information generated by the generation unit In addition, by reflecting the relationship between the predetermined data element and another data element different from the predetermined data element, the relationship reflecting unit that obtains the relationship information for each partial data, and the user who generated the data By referring to a database that stores emotions corresponding to events that occur based on evaluations, and an emotion extraction unit that extracts the data from the data, and a predetermined pattern and a concept included in the predetermined pattern in association with each other And a concept extraction unit that extracts a high-level concept capable of summarizing the partial data including the predetermined pattern from the partial data. That (not shown either in FIG. 1).
  • the database 100 stores a data set having a data group including a predetermined number of data.
  • the database 100 stores data elements that are included in the data and that are associated with the evaluation results in the element evaluation unit 112.
  • the data includes, for example, text data, image data, and / or audio data.
  • the text data may include data indicating a sentence, a plurality of morphemes, or a plurality of co-occurrence expressions.
  • the text data is e-mail, presentation material, spreadsheet material, meeting material, contract, organization chart, business plan, and the like.
  • the data group is a group of data including, for example, a plurality or a predetermined number of data having a predetermined group.
  • the data set includes a plurality of data groups.
  • the data element is, for example, a word, and the word is a minimum language unit having a specific meaning and function in the grammar.
  • a score calculated for data (for example, if the data is a document file, the score corresponds to “document score”. Evaluation result) is, for example, the level of relevance between the data and a predetermined case. Is a numerical value indicating The larger the value, the higher the relevance.
  • the data analysis system 1 stores a combination as a dictionary that stores a combination of a plurality of words related to predetermined classification information in association with a score indicating the level of relevance with the predetermined classification information. A part.
  • the data analysis system 1 analyzes a sentence in the file based on morphological analysis, and a file in which a combination of a plurality of words stored in the combination storage unit is selected It is judged whether it is included in.
  • the data analysis system 1 determines that the combination of words stored in the combination storage unit is included in the selected file, the distance between each of the plurality of words, the word order of the plurality of words, and / or Alternatively, the level of relevance of the file with respect to predetermined classification information is determined based on whether or not a plurality of words are included in the same sentence. Then, the data analysis system 1 associates information indicating the determination result (that is, information indicating the level of relevance with respect to predetermined classification information) with the selected file.
  • the relationship is, for example, the closeness of the relationship between data and classification information.
  • the familiarity increases as the number of matching keywords increases.
  • the database 100 supplies the stored data to the data extraction unit 102 in response to an action from the data extraction unit 102.
  • the database 100 can also be provided outside the data analysis system 1.
  • the database 100 and other components of the data analysis system 1 excluding the database 100 may be connected to each other via a communication network such as the Internet or a wired or wireless network such as a LAN.
  • the data extraction unit 102 extracts a data group included in the data set stored in the database 100 as a classification target (training data) by the user.
  • the data extraction unit 102 supplies the extracted data group to the data classification unit 108.
  • the input device 104 receives input of classification information as predetermined information from a user. Then, the classification information reception unit 106 receives classification information indicating the classification of data from the user via the input device 104.
  • the input device 104 is, for example, a keyboard, a mouse, and / or a touch panel.
  • the classification information is information for identifying classification targets such as lawsuits, cartels, and / or predetermined stages of their progress, for example.
  • the classification information is associated with an identifier that uniquely identifies the classification target.
  • the classification information is a tag such as “Responsive” indicating that the document is related to the lawsuit, “Hot” indicating that the document is particularly related, and “Non-Responsive” indicating that the document is irrelevant.
  • the classification information is a tag such as “Responsive” indicating that a certain user is attempting to leak confidential information of the organization, “Non-Responsive” indicating that the user is unrelated to the leakage, for example. .
  • the data classification unit 108 classifies the data by associating the predetermined number of data included in the data group with the classification information received by the classification information receiving unit 106 for each predetermined number of data or for each predetermined number of data. .
  • the data classification unit 108 supplies data associated with the classification information to the data element extraction unit 110.
  • the data element extraction unit 110 extracts data elements (for example, morphemes and words) from the data classified by the data classification unit 108.
  • the data element extraction unit 110 supplies the extracted data elements to the element evaluation unit 112.
  • the element evaluation unit 112 evaluates the data element extracted by the data element extraction unit 110 according to a predetermined element criterion. Specifically, the element evaluation unit 112 evaluates the data element according to a measure of mutual dependence between the data element and the classification information associated with the data including the data element. For example, the element evaluation unit 112 calculates a transmission information amount representing a dependency relationship between the data element and the classification information associated with the data including the data element, and calculates the calculated transmission information amount as one of predetermined element criteria. (For example, the amount of transmitted information is calculated from a predetermined definition using a random variable that represents the appearance probability of a predetermined word and a random variable that represents the appearance probability of predetermined classification information. Calculated.) As an example, the element evaluation unit 112 evaluates the data element as a word representing the characteristics of the predetermined classification information as the calculated value of the transmitted information amount increases.
  • the data evaluation unit 114 evaluates the relationship between the data (unknown data) included in the data set and the classification information based on the classification result by the data classification unit 108.
  • the data evaluation unit 114 may include the distance between the data and the data or keywords that are associated with the classification information in advance, the word order, the number of appearances and the appearance frequency of the data that is associated with the classification information in advance, and / or Based on whether or not a plurality of the data are included in the same sentence, the level of relevance of the data with respect to predetermined classification information is determined. Then, the data evaluation unit 114 performs data evaluation based on information indicating the evaluation result (that is, information indicating the level of relevance with respect to predetermined classification information).
  • the data evaluation unit 114 generates appearance information indicating whether or not a predetermined data element appears in the partial data included in the data. More specifically, the data evaluation unit 114 generates, for example, a keyword vector (appearance information) indicating whether or not a predetermined keyword (morpheme) is included in a sentence or paragraph included in the data (document).
  • the keyword vector is a vector indicating whether or not a predetermined keyword associated with the element is included in the data when each element of the keyword vector takes a value of “0” or “1”. (So-called “bag of words”). For example, when the keyword “price” is included in the data, the data evaluation unit 114 changes the element corresponding to the “price” of the keyword vector from “0” to “1”. Then, the data evaluation unit 114 calculates the score S of each data by calculating the inner product of the generated keyword vector and the evaluation value (weight) of each keyword (see [Equation 1] below).
  • s represents a keyword vector
  • w represents a weight vector
  • T means transposition
  • the data evaluation unit 114 has a relationship for each phase (for example, a relationship building phase, a preparation phase, and / or an execution phase) that is an index indicating each stage in which a predetermined action (for example, criminal action) progresses. May be evaluated.
  • the phase is an index indicating each stage where the predetermined action progresses (classified according to the progress of the predetermined action).
  • the phase “Relationship Building” (relationship building) is a premise of the phase of competition (competition), and is a step of building a relationship with a customer / competition.
  • the “Preparation” phase refers to a stage in which information regarding competition is exchanged with competitors (which may be third parties).
  • the phase of “Competition” refers to the stage of presenting a price to a customer, obtaining feedback, and communicating with the competitor regarding the feedback.
  • an action “inquiry from a customer” a predetermined action that causes a lawsuit or fraud investigation
  • an action “obtaining competitive production status” often occurs (a certain action causing a lawsuit or fraud investigation).
  • general actions that can cause litigation or fraud investigations are evident, as associated with each of the above phases.
  • the data evaluation unit 114 uses a keyword in which a weight is associated with each phase in advance, and evaluates the relevance between the data and the classification information for each phase.
  • the data evaluation unit 114 calculates a score of data to be evaluated for each phase by associating a calculated value obtained by adding the weights of keywords included in the phase with the phase. Then, the data evaluation unit 114 evaluates whether each phase has occurred by comparing a score with a predetermined threshold value indicating whether each phase has occurred.
  • the data evaluation unit 114 can also evaluate the relevance by using the data elements evaluated by the element evaluation unit 112. Furthermore, the data evaluation unit 114 can also evaluate the relationship based on the correlation between the first data element and the second data element included in the data (that is, the evaluation process is a co-occurrence process (that is, one This is a process that takes into account the correlation between other words and other words)). For example, the data evaluation unit 114 extracts a first data element and a second data element different from the first data element from a plurality of data elements (for example, morphemes and words) of text data included in the data. Note that the data evaluation unit 114 and the data reevaluation unit 116 have substantially the same function, and both calculate scores.
  • the data evaluation unit 114 also evaluates the relationship by calculating a score. Then, the data evaluation unit 114 determines the position of the first data element and the second data element in the text data, the distance between the two data elements, and / or the keyword and the first data element previously associated with the classification information. In addition, the level of relevance is evaluated according to the degree of coincidence and approximation with the second data element (correlation).
  • the data evaluation unit 114 evaluates the relationship between the sentence included in the text indicated by the text data included in the data and the classification information, and evaluates the relationship between the data and the classification information based on the evaluation result. You can also. For example, the data evaluation unit 114 evaluates the relevance according to the degree of coincidence / approximation between the word included in the sentence and the keyword previously associated with the classification information. The data evaluation unit 114 supplies information indicating the evaluation result to the data re-extraction unit 116 and the reference determination unit 120.
  • the data re-extraction unit (data selection unit) 116 re-extracts (selects) a predetermined number of data from the data evaluated for relevance.
  • the data re-extraction unit 116 supplies the extracted data to the data re-evaluation unit 118, the reference change unit 122, and / or the parameter determination unit 124.
  • the data re-evaluation unit 118 re-evaluates the relationship between the re-extracted data and the classification information.
  • the data re-evaluation unit 118 performs re-evaluation of relevance in the same manner as the data evaluation unit 114 described above.
  • the data re-evaluation unit 118 can also evaluate the relationship by calculating a score indicating the strength of the association between the data and the classification information (that is, a predetermined case). For example, the data re-evaluation unit 118 compares the keyword previously associated with the classification information with the text data included in the data, calculates the degree of coincidence, the degree of approximation, and the like, and determines the relationship according to the calculated result. Assess the height of the.
  • the data re-evaluation unit 118 can also evaluate the relationship for each phase, which is an index indicating each stage where a predetermined action progresses, in the same manner as the data evaluation unit 114.
  • the data re-evaluation unit 118 can also evaluate the relevance by using the data elements evaluated by the element evaluation unit 112. Further, the data re-evaluation unit 118 can also evaluate the relationship based on the correlation between the first data element and the second data element included in the data.
  • the data re-evaluation unit 118 evaluates the relationship between the sentence included in the text indicated by the text data included in the data and the classification information, and evaluates the relationship between the data and the classification information based on the evaluation result. You can also The data reevaluation unit 118 supplies information indicating the reevaluation result to the reference change unit 122 and the parameter determination unit 124.
  • the reference determination unit 120 determines whether the evaluation result by the data evaluation unit 114 satisfies a predetermined reference (or a predetermined threshold). For example, the criterion determination unit 120 determines whether or not the score corresponding to the evaluation result corresponds to a predetermined function curve, and if the score does not correspond to the function curve, the deviation between the curve and the score Determining the degree of The reference determination unit 120 can also determine for each phase whether or not a predetermined standard provided for each phase is satisfied. The reference determination unit 120 supplies information indicating the determination result to the reference change unit 122.
  • the reference change unit 122 changes a predetermined reference used in the reference determination unit 120 based on the re-evaluation result by the data re-evaluation unit 118. For example, when the score corresponding to the evaluation result does not correspond to a predetermined function curve, the reference changing unit 122 can determine that the score is a reference corresponding to the function curve in the reference determining unit 120.
  • the predetermined standard is changed so that
  • the parameter determination unit 124 determines a parameter of a model (for example, a fitting curve calculated by regression analysis) assumed using regression analysis or the like for the distribution of the reevaluation results based on the reevaluation results. .
  • the reference changing unit 122 changes a predetermined reference using the model identified by the parameter determined by the parameter determining unit 124.
  • the parameter determination unit 124 determines an exponent parameter of the function (for example, the value of ⁇ of the function described above).
  • the reference changing unit 122 can uniquely specify a threshold value (for example, the higher the score, the higher the rank) determined by rearranging the scores according to a predetermined rule (for example, ascending or descending order). For example, a threshold value corresponding to a rank most suitable for data extraction among a plurality of ranks) can be changed as a predetermined reference. Furthermore, the reference changing unit 122 can also change predetermined references provided for each phase.
  • a threshold value for example, the higher the score, the higher the rank
  • a predetermined rule for example, ascending or descending order
  • the data element storage unit 126 associates the data element extracted by the data element extraction unit 110 with the result of evaluation of the data element by the element evaluation unit 112 and stores the data element in the database 100.
  • the generation unit generates appearance information indicating whether or not a predetermined data element appears in the partial data included in the data for each partial data. More specifically, for example, the generation unit generates a keyword vector (appearance information) indicating whether or not a predetermined keyword (morpheme) is included in a sentence or paragraph included in data (document) for each sentence or paragraph. To generate.
  • the keyword vector is a vector indicating whether or not a predetermined keyword associated with the element is included in the data when each element of the keyword vector takes a value of “0” or “1”. It is. For example, when the keyword “price” is included in the second sentence or paragraph included in the data, the generation unit changes the element corresponding to the “price” of the keyword vector from “0” to “1”. To "".
  • the relationship reflection unit reflects the relationship between the predetermined data element and another data element different from the predetermined data element in the appearance information generated by the generation unit, thereby generating the relationship information for each partial data.
  • the relationship reflection unit multiplies the keyword vector generated by the generation unit by a correlation matrix indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword.
  • the correlation vector 3 is obtained for each sentence as the relationship information. For example, when the keyword “price” appears in a sentence, the correlation matrix indicates the likelihood (that is, the correlation) that another keyword (for example, “adjustment”) appears for the keyword in the sentence. It is a square matrix represented in each element.
  • the relationship reflection unit outputs the correlation vector 3 to the calculation unit 14.
  • the correlation matrix is optimized in advance using a learning data set including a predetermined number of predetermined document data. For example, when a keyword “price” appears in a certain sentence, a value obtained by normalizing the number of occurrences of other keywords with respect to the keyword between 0 and 1 (that is, a maximum likelihood estimated value) (Therefore, the sum for each column of the correlation matrix is 1). Thereby, the data analysis system 1 can optimally calculate the correlation vector.
  • the data evaluation unit 114 and / or the data reevaluation unit 116 can evaluate the relevance based further on the relationship information (correlation matrix) obtained by the relationship reflection unit. For example, the data evaluation unit 114 and / or the data re-evaluation unit 116, as shown in [Equation 1] below, based on the sum of all correlation vectors obtained by the relationship evaluation unit, A score indicating the degree of association with a predetermined case is calculated for each piece of data. More specifically, the data evaluation unit 114 and / or the data reevaluation unit 116 replaces the above-described [Equation 1] with the above summation value (vertical vector) as shown in [Equation 2] below. The score can be calculated for each data by calculating the inner product of the weight vector W (represented by a horizontal vector) indicating the weight for the predetermined keyword.
  • W represented by a horizontal vector
  • C represents a correlation matrix
  • s s represents the s-th keyword vector 2.
  • TFnorm (the above summed value) is calculated as shown in the following [Equation 3].
  • TF i represents the appearance frequency (Term Frequency) of the i-th keyword
  • s js represents the j-th element of the s-th keyword vector 2.
  • w i is the i-th element of the weight vector W.
  • the emotion extraction unit is the emotion of the user who generated the data, and is based on the evaluation The emotion for the event that has occurred is extracted from the data.
  • the evaluation such as the author's style
  • the emotion extraction unit determines whether or not the keyword included in the text is stored in the database 100 as a data element.
  • the emotion extraction unit sets “+1.2”. The result is an emotion extracted from the text.
  • the concept extraction unit can summarize partial data (for example, sentences, paragraphs, etc.) including the predetermined pattern by referring to the database 100 that stores the predetermined pattern and the concept of the predetermined pattern in association with each other.
  • the superordinate concept is extracted from the partial data. For example, when a sentence “sell accounting system to company A” is included in the email, the concept extraction unit checks whether the keywords “accounting system” and “sell” are registered in the database 100. Check. When “system” is registered in the database 100 as a superordinate concept of “accounting system” and “introduction” is registered as a superordinate concept of “sell”, the concept extraction unit displays the superordinate concept of the sentence. Extract “Install system”. As a result, the data analysis system 1 can accurately evaluate the relevance between the data and the predetermined case without being adversely affected by minor differences in keywords (that is, non-essential differences).
  • FIG. 2 shows an example of the processing flow of the data analysis system according to the embodiment of the present invention.
  • the data extraction unit 102 extracts a data group from the data set stored in the database 100 (step 10; hereinafter, step is represented as “S”).
  • the data extraction unit 102 supplies the extracted data group to the data classification unit 108.
  • the classification information receiving unit 106 receives classification information from the user via the input device 104 (S15).
  • the classification information receiving unit 106 supplies the received classification information to the data classification unit 108.
  • the data classification unit 108 associates the classification information received by the classification information reception unit 106 with each of the predetermined number of data included in the data group (S20).
  • the data classification unit 108 supplies information indicating the associated result to the data evaluation unit 114.
  • the data evaluation unit 114 evaluates the relevance between the data of the data group included in the data set and the classification information (that is, the predetermined case) (S25, data evaluation step).
  • the data evaluation unit 114 supplies information indicating the evaluation result to the data re-extraction unit 116 and the reference determination unit 120.
  • the reference determination unit 120 analyzes information indicating the evaluation result and determines whether or not the evaluation result satisfies a predetermined reference (S30, reference determination step).
  • the data re-extraction unit 116 re-extracts a predetermined number of data from the evaluated data when receiving information indicating the evaluation result (S35, data selection step).
  • the data re-extraction unit 116 supplies the extracted data to the data re-evaluation unit 118.
  • the data re-evaluation unit 118 re-evaluates the relationship between the data received from the data re-extraction unit 116 and the classification information (S40, data re-evaluation step).
  • the data reevaluation unit 118 supplies information indicating the evaluation result to the reference changing unit 122.
  • the reference changing unit 122 analyzes information indicating the result of the re-evaluation by the data reevaluating unit 118 and changes a predetermined reference (S45, reference changing step). Then, the reference changing unit 122 causes the reference determining unit 120 to execute determination based on the changed predetermined reference.
  • FIG. 3 shows an example of the hardware configuration of the data analysis system according to the embodiment of the present invention.
  • the data analysis system 1 includes a CPU 1500, a graphic controller 1520, a random access memory (RAM), a memory 1530 such as a read-only memory (ROM) and / or a flash ROM, and a storage device 1540 for storing data.
  • a reading / writing device 1545 for reading data from and / or writing data to a recording medium, an input device 1560 for inputting data, a communication interface 1550 for transmitting / receiving data to / from an external communication device, a CPU 1500 and a graphic controller Chipset 1 that connects 1520, memory 1530, storage device 1540, read / write device 1545, input device 1560, and communication interface 1550 so that they can communicate with each other And a 10.
  • the chip set 1510 includes a memory 1530, a CPU 1500 that accesses the memory 1530 and executes predetermined processing, and a graphic controller 1520 that controls display on an external display device. Perform data passing.
  • the CPU 1500 operates based on a program stored in the memory 1530 and controls each component.
  • the graphic controller 1520 displays an image on a predetermined display device based on the image data temporarily stored on the buffer provided in the memory 1530.
  • the chipset 1510 connects a storage device 1540, a read / write device 1545, and a communication interface 1550.
  • the storage device 1540 stores programs and data used by the CPU 1500 of the data analysis system 1.
  • the storage device 1540 is, for example, a flash memory.
  • the read / write device 1545 reads the program and / or data from the storage medium storing the program and / or data, and stores the read program and / or data in the storage device 1540.
  • the reading / writing device 1545 acquires a predetermined program from a server on the Internet via the communication interface 1550, and stores the acquired program in the storage device 1540.
  • the communication interface 1550 executes data transmission / reception with an external device via a communication network. Further, when the communication network is disconnected, the communication interface 1550 can execute data transmission / reception with an external device without going through the communication network.
  • An input device 1560 such as a keyboard, a tablet, or a mouse is connected to the chip set 1510 via a predetermined interface.
  • the data analysis program for the data analysis system 1 stored in the storage device 1540 is provided to the storage device 1540 via a communication network such as the Internet or a recording medium such as a magnetic recording medium or an optical recording medium. Then, the program for the data analysis system 1 stored in the storage device 1540 is executed by the CPU 1500.
  • the data analysis program executed by the data analysis system 1 works on the CPU 1500 to change the data analysis system 1 to the database 100, the data extraction unit 102, the input device 104, which are described in FIGS.
  • Classification information reception unit 106 data classification unit 108, data element extraction unit 110, element evaluation unit 112, data evaluation unit 114, data re-extraction unit 116, data re-evaluation unit 118, reference determination unit 120, reference change unit 122, parameter It functions as the determination unit 124 and the data element storage unit 126.
  • the data analysis system 1 repeats the extraction of data, the evaluation of the relationship between the data and the classification information, and the determination as to whether or not the evaluation result satisfies a predetermined standard, Since the extraction accuracy of relevant data can be improved, it is possible to prevent deterioration in learning accuracy.
  • Example 4 (Example) 4, 6, and 8 are fitting curves before re-evaluating the relationship between data and classification information, and FIGS. 5, 7, 9, and 10 are fitting curves after re-evaluation. An example of a curve is shown. Note that the figure before the data fitting according to FIG. 10 is omitted. Further, the fluctuating graph indicated by the arrow is a fitting curve, and the other graph is a graph showing raw data.
  • the data analysis system 1 sets the three document scores as score thresholds for each stage.
  • the data evaluation unit 114 uses predictive coding to calculate three types of scores corresponding to the three stages for one document (that is, the database stores three types of keyword weights). For example, in the “relationship building phase” (phase 1), keyword weights such as “schedule” and “adjustment” are larger than those in the “execution phase” (phase 3), or “preparation phase” (phase 2). Then, keyword weights such as “competitive product” and “investigation” are larger than those in the “relationship building phase” (phase 1), and a different keyword may be stored for each stage).
  • the reference determination unit 120 determines whether or not the calculated three types of scores exceed the score threshold. (7) For example, when it is determined that the score corresponding to phase 2 exceeds the score threshold, the reference determination unit 120 determines that “there is a high possibility that the current phase is in phase 2”, and performs system management. Alerts people. (8) The data re-extraction unit 116 or the system administrator refers to the calculated score and adjusts the learning of the system. For example, when a document with a high score is considered, the tag “Non-Responsive” is added as “unrelated to the lawsuit”. On the other hand, when a document with a low score is examined, the tag “Responsive” is added as “related to the lawsuit”. Such information is fed back to the system.
  • the data analysis system 1 can increase or decrease the weight of the keyword included in the document, add the unregistered keyword and the weight of the keyword to the database, Delete keywords from the database.
  • the data re-evaluation unit 118 recalculates the score for each phase and refits the exponential function (FIGS. 5, 7, 9, and 10).
  • the reference changing unit 122 resets the score threshold by the same processing as the above (3) and (4).
  • the data analysis system 1 repeats the above (5) to (10) as necessary.
  • the data analysis system 1 can automatically increase keyword variations while maintaining the accuracy of (a) preventing learning accuracy from being deteriorated and (b) determining by the above flow. That is, the data analysis system 1 can automatically adapt to the operating environment (that is, automatic environment adaptation can be realized).
  • FIG. 4, FIG. 6, and FIG. 8 each show the result of analyzing a data set composed of a plurality of predetermined different data.
  • a fitting curve was calculated for each predetermined phase (hereinafter referred to as “stage” in the examples). That is, in the example, the fitting curves were calculated for each of the predetermined first phase (stage 1), second phase (stage 2), and third phase (stage 3).
  • the horizontal axis indicates the document score
  • the vertical axis indicates the logarithmic scale normalized rank (that is, the rank when document scores are arranged in ascending order). Accordingly, the fitting curve using the exponential function is a straight line in the figure, and the rank is higher at the lower part of the vertical axis and the rank is lower at the upper part.
  • FIG. 5, FIG. 7, FIG. 9, and FIG. 10 each show the fitting after changing the exponent parameter of a predetermined function to the parameter determined in the parameter determination unit 124 after the re-evaluation processing in the data re-evaluation unit 118. Shows the curve.
  • the determination coefficient R2 takes a value very close to 1, so that the fitting curve after the reevaluation process is accurately converted into experimental data. It was shown to fit.
  • the data analysis system 1 can also be applied to data other than documents. That is, the data analysis system 1 can also analyze data other than text. For example, when the data analysis system 1 analyzes speech, (1) by recognizing the speech, the content of the conversation included in the speech may be converted into characters (text), and the text may be analyzed ( 2) The voice data may be analyzed as it is.
  • the data analysis system 1 converts speech into text by using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model), and is similar to the processing described above. Is performed on the text. Thereby, the data analysis system 1 can analyze the voice.
  • an arbitrary speech recognition algorithm for example, a recognition method using a hidden Markov model
  • the data analysis system 1 extracts a partial voice (data element) included in the voice. For example, when a voice “adjust price” is obtained, the data analysis system 1 extracts partial voices “price” and “adjustment” from the voice, and based on the evaluation result of the partial voice, It is possible to evaluate the relevance between the classification voice (unknown data) and the classification information (predetermined case). In this case, the data analysis system 1 can separate voices using a time series data classification algorithm (for example, a hidden Markov model, a Kalman filter, a neural network, etc.). Thereby, the data analysis system 1 can analyze the voice.
  • a time series data classification algorithm for example, a hidden Markov model, a Kalman filter, a neural network, etc.
  • the data analysis system 1 can also analyze video (moving images).
  • the data analysis system 1 can identify a person included in the frame image by extracting a frame image included in the video and using an arbitrary face recognition technique.
  • the data analysis system 1 uses an arbitrary motion recognition technique (for example, a pattern matching technique may be applied), thereby allowing a partial video included in the video (all frame images included in the video to be displayed).
  • the motion (motion) of the person can be extracted from the video including a part of the video.
  • the data analysis system 1 can evaluate the relevance between the unclassified video (unknown data) and the classification information (predetermined case) based on the person and / or motion. Thereby, the data analysis system 1 can analyze the video.
  • the data analysis system 1 can be applied not only to a forensic system (a system for extracting lawsuit related documents) but also to the following.
  • the present invention can be applied to medical application systems (systems that predict risks such as diseases using electronic medical records and nursing records as data).
  • the medical application system uses data (for example, electronic medical records, nursing records, etc.) and a predetermined case (for example, that the drug has been effective for the patient, that the patient has been ready after the diagnosis by the doctor, etc.)
  • the effect of the drug can be objectively evaluated, or a diagnosis by a skilled doctor can be applied to other patients.
  • the said medical application system can evaluate relevance for every phases, such as a follow-up phase (phase 1), a remission phase (phase 2), a complete cure phase (phase 3), a recurrence phase (phase 4), for example.
  • the data analysis system 1 can also be applied to an email audit system.
  • the e-mail auditing system determines that the data (for example, e-mail distributed daily on the network) and a predetermined case (for example, the e-mail is leaking confidential information of the organization, the e-mail is sent to another organization).
  • a predetermined case for example, the e-mail is leaking confidential information of the organization, the e-mail is sent to another organization.
  • the data analysis system 1 can also be applied to an Internet application system.
  • the Internet application system uses data (for example, a message posted by the user to the SNS, recommended information posted on the website, profile of the user or group, etc.) and a predetermined case (for example, the user's preference and the like). (E.g., the preference of other users is similar, the preference of the user matches the attribute of the restaurant, etc.) It is possible to display a list of other users, present restaurant information that suits the user's preferences, and warn organizations that may harm the user. Thereby, the Internet application system (data analysis system 1) can improve the convenience of the Internet.
  • the data analysis system 1 can also be applied to a driving support system.
  • the driving support system includes data (for example, data acquired from an in-vehicle sensor, a camera, a microphone, and the like) and a predetermined case (for example, information that the skilled driver paid attention to while driving by the skilled driver). For example, useful information that can make driving safer and more comfortable can be automatically extracted.
  • the data analysis system 1 can be applied to financial related systems.
  • the financial system evaluates the relevance of the data (for example, notification documents to the bank, the market price of the stock price, etc.) and a predetermined case (for example, there is a risk of fraud or a rise in the stock price). By doing so, for example, it is possible to detect a report having an unauthorized purpose or to predict a future stock price.
  • the data analysis system 1 can be applied to a performance evaluation system.
  • the performance evaluation system includes data (for example, daily reports submitted by the sales staff to the company, analysis materials submitted by the consultant to the customer) and predetermined cases (for example, the sales staff will increase the sales performance, the consultant By evaluating the relevance to the customer), for example, it is possible to perform personnel evaluations of sales staff / consultants and to evaluate the success or failure of the project.
  • the data analysis system of the present invention shows the relationship between data and a predetermined case such as a forensic system, a discovery support system, a medical application system, an Internet application system, a driving support system, a financial related system, and a performance evaluation system. It can be applied to any system that achieves its purpose by evaluation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided is a data analysis system which, by analyzing whether a pattern which is extracted from training data is included in unknown data, evaluates a correlation between the unknown data and a prescribed case, said data analysis system comprising: a data evaluating unit which evaluates a correlation on the basis of the extracted pattern; a reference assessment unit which assesses whether the result of the evaluation satisfies a prescribed reference; a data selection unit which selects a prescribed number of instances of data from the unknown data for which the correlation has been evaluated; a data re-evaluating unit which re-evaluates the correlation between the selected data and the prescribed case on the basis of a pattern which is included in the selected data; and a reference change unit which changes the prescribed reference on the basis of the result of the re-evaluation.

Description

データ分析システム、データ分析システムの制御方法、及びデータ分析システムの制御プログラムData analysis system, data analysis system control method, and data analysis system control program
 本発明は、データ分析システム、データ分析システムの制御方法、及びデータ分析システムの制御プログラムに関する。 The present invention relates to a data analysis system, a data analysis system control method, and a data analysis system control program.
 従来、収集された文書情報から所定数の文書を抽出する抽出部と、抽出された文書群を画面上に表示する文書表示部と表示された文書群に対して、ユーザが訴訟との関連性に基づいて付与した分別符号を受け付ける分別符号受付部と、分別符号に基づいて、抽出された文書群を分別符号ごとに分別し、当該分別された文書群において、共通して出現するキーワードを解析し選定する選定部と、選定したキーワードを記録するデータベースと、データベースに記録されたキーワードを文書情報から探索する探索部と、探索部の探索結果と選定部の解析結果を用いて、分別符号と文書との関連性を示すスコアを算出するスコア算出部と、スコアの結果に基づいて自動で分別符号を付与する自動分別部を備える文書分別システムが知られている(例えば、特許文献1参照。)。特許文献1に記載の文書分別システムによれば、訴訟において証拠として提出するために収集されたデジタル化された文書情報を分析し、訴訟への利用が容易になるように分別できる。 Conventionally, a user is related to a lawsuit with respect to an extracted unit that extracts a predetermined number of documents from collected document information, a document display unit that displays the extracted document group on the screen, and the displayed document group. A classification code receiving unit that receives a classification code assigned based on the classification code, and classifies the extracted document group for each classification code based on the classification code, and analyzes keywords that appear in common in the classified document group A selection part to be selected, a database for recording the selected keyword, a search part for searching the keyword recorded in the database from document information, a search result of the search part and an analysis result of the selection part, There is known a document classification system including a score calculation unit that calculates a score indicating relevance with a document, and an automatic classification unit that automatically assigns a classification code based on the result of the score ( In example, see Patent Document 1.). According to the document separation system described in Patent Document 1, it is possible to analyze digitized document information collected for submission as evidence in a lawsuit and separate it so that it can be easily used in a lawsuit.
特開2013-182338号公報JP 2013-182338 A
 特許文献1に記載されているような文書分別システムにおいては、分別対象となる文書情報からランダムに文書をサンプリングし、分別符号を付与する対象となる文書群をレビュワーが手動で抽出する場合があり、手動で抽出する場合、自動的にデータの抽出精度を向上させることはできない。したがって、デジタル情報を自動的に分別するとともに、分別処理における自動処理精度の劣化を低減することによって学習精度の劣化を防止することが望まれる。 In a document classification system as described in Patent Document 1, a document may be randomly sampled from document information to be classified, and a reviewer may manually extract a document group to which a classification code is to be assigned. When extracting manually, the data extraction accuracy cannot be improved automatically. Therefore, it is desirable to prevent the deterioration of the learning accuracy by automatically separating the digital information and reducing the deterioration of the automatic processing accuracy in the separation processing.
 したがって、本発明の目的は、分別処理の自動処理精度の劣化を低減できるデータ分析システム、データ分析方法、及びデータ分析プログラムを提供することにある。 Therefore, an object of the present invention is to provide a data analysis system, a data analysis method, and a data analysis program that can reduce deterioration in automatic processing accuracy of the separation process.
 本発明の一態様に係るデータ分析システムは、上記目的を達成するため、訓練データから抽出されたパターンが未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するデータ分析システムであって、抽出されたパターンに基づいて関連性を評価するデータ評価部と、データ評価部による評価結果が、所定の基準を満たしているか否かを判定する基準判定部と、データ評価部によって関連性が評価された未知データから所定数のデータを選出するデータ選出部と、データ選出部によって選出されたデータと所定の事案との関連性を、当該選出されたデータに含まれるパターンに基づいて再評価するデータ再評価部と、データ再評価部による再評価結果に基づいて、基準判定部において用いられる所定の基準を変更する基準変更部とを備えている。 In order to achieve the above object, a data analysis system according to an aspect of the present invention analyzes whether or not a pattern extracted from training data is included in unknown data, thereby determining the unknown data and a predetermined case. A data analysis system for evaluating relevance, a data evaluation unit for evaluating relevance based on an extracted pattern, and a criterion for determining whether an evaluation result by the data evaluation unit satisfies a predetermined standard The determination unit, the data selection unit that selects a predetermined number of data from the unknown data that has been evaluated for relevance by the data evaluation unit, and the relationship between the data selected by the data selection unit and the predetermined case are selected. The data re-evaluation unit that re-evaluates based on the patterns included in the data and the reference determination unit based on the re-evaluation result by the data re-evaluation unit. And a reference change unit for changing the predetermined reference that.
 上記データ分析システムは、少なくとも次の3つの構成で実施され得る。すなわち、上記データ分析システムは、(a)クライアント装置(例えば、パーソナルコンピュータ、スマートフォンなどのユーザ端末)において当該データ分析システムを実現させるデータ分析プログラムの一部または全部が実行される構成で実施されてもよいし、(b)サーバー装置(例えば、メインフレーム、クラスタコンピュータ、上記システムによるサービスを外部の機器に提供可能な任意のコンピュータなど)において上記データ分析プログラムの一部または全部が実行され、当該実行された結果が上記クライアント装置に返される構成で実施されてもよいし、(c)上記データ分析プログラムに含まれる処理を、上記クライアント装置およびサーバー装置において任意に分担する構成で実施されてもよい。言い換えれば、少なくとも1つのコンピュータから構成されるシステムとして上記データ分析システムが実現されてさえいればよく、当該データ分析システムに含まれる各機能は、当該システムを構成するコンピュータによって任意に分担して実現され得る。 The data analysis system can be implemented with at least the following three configurations. That is, the data analysis system is implemented in a configuration in which (a) a part or all of a data analysis program for realizing the data analysis system is executed in a client device (for example, a user terminal such as a personal computer or a smartphone). (B) A part or all of the data analysis program is executed in a server device (for example, a mainframe, a cluster computer, an arbitrary computer capable of providing services provided by the system to an external device, etc.) The executed result may be returned to the client device, or (c) the processing included in the data analysis program may be executed arbitrarily in the client device and the server device. Good. In other words, it is sufficient that the data analysis system is realized as a system composed of at least one computer, and each function included in the data analysis system is realized by arbitrarily sharing the functions of the computer constituting the system. Can be done.
 また、本発明の一態様に係るデータ分析システムにおいて、再評価結果の分布に対して仮定したモデルのパラメータを、当該再評価結果に基づいて決定するパラメータ決定部をさらに備え、基準変更部は、例えば、決定されたパラメータによって同定されるモデルを用いて、所定の基準を変更するものであってもよい。 Further, in the data analysis system according to an aspect of the present invention, the data analysis system further includes a parameter determination unit that determines a parameter of the model assumed for the distribution of the re-evaluation result based on the re-evaluation result, and the reference changing unit includes: For example, a predetermined criterion may be changed using a model identified by the determined parameter.
 また、本発明の一態様に係るデータ分析システムにおいて、モデルは、例えば、指数型分布族に属する関数であり、パラメータ決定部は、関数の指数パラメータを決定するものであってもよい。 In the data analysis system according to one aspect of the present invention, the model may be a function belonging to an exponential distribution family, for example, and the parameter determination unit may determine an exponent parameter of the function.
 また、本発明の一態様に係るデータ分析システムにおいて、データ再評価部は、例えば、選出されたデータと所定の事案との結びつきの強さを示すスコアを算出することによって、関係性を評価するものであり、所定の基準は、スコアに対して設けられた閾値を超過したか否かに基づくものであり、基準変更部は、閾値を、スコアを所定の規則にしたがって並べ替えることにより決定されるランクから一意に特定可能な他の閾値に変更することによって、所定の基準を変更するものであってもよい。 In the data analysis system according to one aspect of the present invention, the data re-evaluation unit evaluates the relationship by, for example, calculating a score indicating the strength of the connection between the selected data and the predetermined case. The predetermined criterion is based on whether or not a threshold value set for the score is exceeded, and the reference changing unit is determined by rearranging the threshold value according to a predetermined rule. The predetermined reference may be changed by changing from another rank to another threshold value that can be uniquely specified.
 また、本発明の一態様に係るデータ分析システムは、データに含まれる部分データに所定のデータ要素が出現するか否かを示す出現情報を、当該部分データごとに生成する生成部と、生成部によって生成された出現情報に、所定のデータ要素と当該所定のデータ要素とは異なる他のデータ要素との関係性を反映させることによって、部分データごとに関係性情報を得る関係性反映部とをさらに備え、データ評価部および/またはデータ再評価部は、例えば、関係性反映部によって得られた関係性情報にさらに基づいて、関連度を評価するものであってもよい。 In addition, the data analysis system according to one aspect of the present invention includes a generation unit that generates, for each partial data, appearance information indicating whether or not a predetermined data element appears in the partial data included in the data, and a generation unit A relationship reflecting unit that obtains relationship information for each partial data by reflecting the relationship between the predetermined data element and another data element different from the predetermined data element in the appearance information generated by In addition, the data evaluation unit and / or the data reevaluation unit may evaluate the degree of association based on the relationship information obtained by the relationship reflection unit, for example.
 また、本発明の一態様に係るデータ分析システムにおいて、データは、例えば、事象に対するユーザの評価を少なくとも含むものであり、データを生成したユーザの感情であって、評価に基づいて生じた事象に対する感情を、当該データから抽出する感情抽出部をさらに備え、データ評価部および/またはデータ再評価部は、感情抽出部による抽出結果にさらに基づいて、関連度を評価するものであってもよい。 In the data analysis system according to one aspect of the present invention, the data includes, for example, at least a user's evaluation of the event, and is an emotion of the user who generated the data, and for the event that occurred based on the evaluation. An emotion extraction unit that extracts emotion from the data may be further provided, and the data evaluation unit and / or the data reevaluation unit may evaluate the degree of association based on the extraction result by the emotion extraction unit.
 また、本発明の一態様に係るデータ分析システムは、所定のパターンと当該所定のパターンが有する概念とを対応付けて記憶するデータベースを参照することによって、当該所定のパターンを含む部分データを要約可能な上位概念を、当該部分データから抽出する概念抽出部をさらに備えてよい。 In addition, the data analysis system according to an aspect of the present invention can summarize partial data including the predetermined pattern by referring to a database that stores the predetermined pattern and a concept included in the predetermined pattern in association with each other. It may further include a concept extraction unit that extracts a high-level concept from the partial data.
 ここで、上記データベースは、上記データ分析システムを実現するコンピュータと読み書き可能に接続された記憶媒体でありさえすればよく、例えば、当該コンピュータに接続されたハードディスク、当該コンピュータと所定のネットワーク(例えば、インターネット、イントラネットなど)を介して通信可能に接続されたファイルサーバー、当該コンピュータと当該所定のネットワークを介して通信可能に接続された他のコンピュータが備えたメモリなどを広く含む。 Here, the database only needs to be a storage medium readable and writable connected to the computer that implements the data analysis system. For example, the hard disk connected to the computer, the computer and a predetermined network (for example, Widely includes a file server communicably connected via the Internet, an intranet, etc., a memory provided in another computer communicably connected to the computer via the predetermined network, and the like.
 また、本発明の一態様に係るデータ分析システムは、所定数のデータを訓練データとしてデータベースに格納されたデータセットから抽出するデータ抽出部と、訓練データの分類を示す分類情報を、所定の入力装置を介してユーザから受け付ける分類情報受付部と、訓練データに受け付けられた分類情報を対応付けることによって、当該訓練データを分類するデータ分類部とをさらに備え、データ分析システムは、例えば、訓練データが分類された結果に基づいて、当該訓練データからパターンを抽出するものであってもよい。 In addition, the data analysis system according to one aspect of the present invention includes a data extraction unit that extracts a predetermined number of data from a data set stored in a database as training data, and classification information indicating the classification of the training data. The data analysis system further includes a classification information receiving unit that receives from the user via the apparatus and a data classification unit that classifies the training data by associating the received classification information with the training data. A pattern may be extracted from the training data based on the classified result.
 ここで、上記データ分析システムは、(a)上記データ分析システムを実現するコンピュータと所定のネットワークを介して通信可能に接続されたクライアント装置が、所定の入力装置(例えば、キーボード、マウスなど)を備え、当該クライアント装置が当該所定の入力装置を介して入力された上記分類情報を当該コンピュータに送信し、上記コンピュータ上において実現される分類情報受付部が、当該分類情報を取得する構成であってもよいし、(b)上記コンピュータが、上記所定の入力装置を備え、上記分類情報受付部が、当該分類情報を取得する構成であってもよい。 Here, the data analysis system includes: (a) a client device that is communicably connected to a computer that implements the data analysis system via a predetermined network; and a predetermined input device (for example, a keyboard, a mouse, etc.) The client device transmits the classification information input via the predetermined input device to the computer, and a classification information receiving unit realized on the computer acquires the classification information. (B) The computer may include the predetermined input device, and the classification information receiving unit may acquire the classification information.
 また、本発明の一態様に係るデータ分析システムにおいて、データ評価部および/またはデータ再評価部は、例えば、所定の行為が進展する各段階を示す指標であるフェーズごとに関係性を評価するものであり、基準判定部は、フェーズごとに設けられた所定の基準を満たしているか否かを、当該フェーズごとに判定するものであり、基準変更部は、フェーズごとに設けられた所定の基準をそれぞれ変更するものであってもよい。 In the data analysis system according to one aspect of the present invention, the data evaluation unit and / or the data re-evaluation unit evaluates the relationship for each phase, which is an index indicating each stage in which a predetermined action progresses, for example. The standard determination unit determines, for each phase, whether or not the predetermined standard provided for each phase is satisfied, and the reference change unit determines the predetermined standard provided for each phase. Each may be changed.
 また、本発明の一態様に係るデータ分析システムは、データ分類部によって分類されたデータからデータ要素を抽出するデータ要素抽出部と、データ要素を所定の要素基準にしたがって評価する要素評価部とをさらに備え、データ評価部および/またはデータ再評価部は、例えば、要素評価部によって評価されたデータ要素を用いることによって、関連性を評価するものであってもよい。 The data analysis system according to one aspect of the present invention includes a data element extraction unit that extracts data elements from the data classified by the data classification unit, and an element evaluation unit that evaluates the data elements according to a predetermined element criterion. In addition, the data evaluation unit and / or the data reevaluation unit may evaluate the relevance by using, for example, the data element evaluated by the element evaluation unit.
 また、本発明の一態様に係るデータ分析システムにおいて、要素評価部は、例えば、要素評価部は、データ要素と当該データ要素を含むデータに対応付けられた分類情報との依存関係を表わす伝達情報量を、所定の要素基準の1つとして用いることによって、当該データ要素を評価するものであってもよい。 In the data analysis system according to one aspect of the present invention, the element evaluation unit, for example, the element evaluation unit includes transmission information representing a dependency relationship between the data element and the classification information associated with the data including the data element. The data element may be evaluated by using the quantity as one of the predetermined element criteria.
 また、本発明の一態様に係るデータ分析システムは、データ要素抽出部によって抽出されたデータ要素と、要素評価部によって当該データ要素が評価された結果とを対応付けて、データベースに格納するデータ要素格納部をさらに備えてもよい。 Further, the data analysis system according to one aspect of the present invention includes a data element stored in a database in association with the data element extracted by the data element extraction unit and the result of evaluation of the data element by the element evaluation unit A storage unit may be further provided.
 また、本発明の一態様に係るデータ分析システムにおいて、データ評価部および/またはデータ再評価部は、例えば、データに含まれる第1データ要素と第2データ要素との相関に基づいて関係性を評価するものであってもよい。 In the data analysis system according to an aspect of the present invention, the data evaluation unit and / or the data reevaluation unit, for example, determines the relationship based on the correlation between the first data element and the second data element included in the data. You may evaluate.
 また、本発明の一態様に係るデータ分析システムにおいて、データは、例えば、テキストを少なくとも一部に含み、データ評価部および/またはデータ再評価部は、テキストに含まれるセンテンスまたは段落と所定の事案との関連性を評価し、当該評価結果に基づいて、データと当該所定の事案との関連性を評価するものであってもよい。 In the data analysis system according to one aspect of the present invention, the data includes, for example, at least a part of text, and the data evaluation unit and / or the data reevaluation unit includes a sentence or a paragraph included in the text and a predetermined case. The relationship between the data and the predetermined case may be evaluated based on the evaluation result.
 また、本発明の一態様に係るデータ分析方法は、上記目的を達成するため、訓練データから抽出されたパターンが、未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するデータ分析システムの制御方法であって、抽出したパターンに基づいて関連性を評価するデータ評価ステップと、データ評価ステップにおける評価結果が所定の基準を満たしているか否かを判定する基準判定ステップと、データ評価ステップにおいて関連性を評価した未知データから所定数のデータを選出するデータ選出ステップと、データ選出ステップにおいて選出したデータと所定の事案との関連性を、当該選出されたデータに含まれるパターンに基づいて再評価するデータ再評価ステップと、データ再評価ステップにおける再評価結果に基づいて、所定の基準を変更する基準変更ステップとを含んでいる。 Further, in order to achieve the above object, the data analysis method according to one aspect of the present invention analyzes whether or not a pattern extracted from training data is included in unknown data, and A data analysis system control method for evaluating relevance to a case, in which a data evaluation step for evaluating relevance based on an extracted pattern and whether or not an evaluation result in the data evaluation step satisfies a predetermined standard A standard determination step for determining the relationship, a data selection step for selecting a predetermined number of data from unknown data whose relevance was evaluated in the data evaluation step, and a relationship between the data selected in the data selection step and a predetermined case A data re-evaluation step that re-evaluates based on the patterns contained in the selected data; Based on the re-evaluation result at flop, and a reference changing step of changing a predetermined reference.
 また、本発明の一態様に係るデータ分析プログラムは、上記目的を達成するため、訓練データから抽出されたパターンが、未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するデータ分析システムの制御プログラムであって、コンピュータに、抽出されたパターンに基づいて関連性を評価するデータ評価機能と、データ評価機能による評価結果が所定の基準を満たしているか否かを判定する基準判定機能と、データ評価機能によって関連性が評価された未知データから所定数のデータを選出するデータ選出機能と、データ選出機能によって選出されたデータと所定の事案との関連性を、当該選出されたデータに含まれるパターンに基づいて再評価するデータ再評価機能と、データ再評価機能による再評価結果に基づいて、所定の基準を変更する基準変更機能とを実現させる。 Further, in order to achieve the above object, the data analysis program according to one aspect of the present invention analyzes whether or not a pattern extracted from training data is included in the unknown data, thereby determining the unknown data and the predetermined data. A data analysis system control program that evaluates the relevance of a case. The data evaluation function evaluates the relevance based on the extracted pattern on the computer, and the evaluation result of the data evaluation function satisfies a predetermined standard. A criterion determination function for determining whether or not the data is selected, a data selection function for selecting a predetermined number of data from unknown data whose relevance is evaluated by the data evaluation function, data selected by the data selection function, and a predetermined case A data re-evaluation function that re-evaluates the relevance of the data based on the pattern contained in the selected data, and data re-evaluation Based on the reevaluation result of ability to achieve a standard changing function of changing a predetermined reference.
 本発明に係るデータ分析システム、データ分析方法、及びデータ分析プログラムによれば、分別処理の自動処理精度の劣化を低減できるデータ分析システム、データ分析方法、及びデータ分析プログラムを提供できる。 According to the data analysis system, the data analysis method, and the data analysis program according to the present invention, it is possible to provide a data analysis system, a data analysis method, and a data analysis program that can reduce deterioration of the automatic processing accuracy of the separation process.
実施の形態に係るデータ分析システムの機能構成ブロック図である。It is a functional block diagram of the data analysis system concerning an embodiment. 実施の形態に係るデータ分析システムの処理のフロー図である。It is a flowchart of a process of the data analysis system which concerns on embodiment. 実施の形態に係るデータ分析システムのハードウェア構成図である。It is a hardware block diagram of the data analysis system which concerns on embodiment. 選出データにフィッティングさせる前のフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve before making it fit to selection data. 回帰分析により選出データにフィッティングさせたフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve fitted to selection data by regression analysis. 選出データにフィッティングさせる前のフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve before making it fit to selection data. 回帰分析により選出データにフィッティングさせたフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve fitted to selection data by regression analysis. 選出データにフィッティングさせる前のフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve before making it fit to selection data. 回帰分析により選出データにフィッティングさせたフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve fitted to selection data by regression analysis. 回帰分析により選出データにフィッティングさせたフィッティングカーブを示すグラフである。It is a graph which shows the fitting curve fitted to selection data by regression analysis.
[実施の形態]
(データ分析システム1の概要)
 本実施の形態に係るデータ分析システム1は、訓練データから抽出されたパターンが未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するシステムである。より具体的には、データ分析システム1は、例えば、ユーザ端末若しくはサーバー等の情報処理装置に格納されているデータ(例えば、電子メールやドキュメントに関するデータ等)を、所定の事案と各データとの関連性の評価結果が所定の基準を満たすか否かに応じて、当該データが当該所定の事案に関連しているか否かを評価するシステムである。
[Embodiment]
(Outline of data analysis system 1)
The data analysis system 1 according to the present embodiment is a system that evaluates the relevance between the unknown data and a predetermined case by analyzing whether or not the pattern extracted from the training data is included in the unknown data. is there. More specifically, the data analysis system 1 uses, for example, data stored in an information processing apparatus such as a user terminal or a server (for example, data related to an e-mail or a document) between a predetermined case and each data. The system evaluates whether or not the data is related to the predetermined case according to whether or not the evaluation result of the relevance satisfies a predetermined standard.
 ここで、上記所定の事案は、例えば、訴訟、カルテル、不正行為、及び/又はこれらの進行の段階等に関連することを示す情報である。また、上記訓練データ、及び/又は未知データ(以下「データ」と総称する場合がある)は、例えば、テキストデータ(電子メール、プレゼンテーション資料、表計算資料、打ち合わせ資料、契約書、組織図、事業計画書など、少なくとも一部にテキストを含むデータ)であってよく、上記パターンは、当該テキストデータに含まれるキーワード(例えば、形態素)であってよい(例えば、センテンス、段落などであってもよい)。そして、本実施の形態に係るデータ分析システム1は、一例として、不正アクセスや機密情報漏洩等のコンピュータに関する犯罪や法的紛争が生じた場合に、犯罪や紛争の原因究明や捜査に要する電子的記録であるデータを収集及び分析し、その法的な証拠性を明らかにする技術であるフォレンジックシステム(または、米国民事訴訟におけるディスカバリ手続きを支援するディスカバリ支援システム)に適用できる。 Here, the predetermined case is information indicating that it relates to, for example, a lawsuit, a cartel, an illegal act, and / or a stage of progress thereof. The training data and / or unknown data (hereinafter sometimes referred to as “data”) is, for example, text data (e-mail, presentation material, spreadsheet material, meeting material, contract, organization chart, business) Data including at least a part of text such as a plan document), and the pattern may be a keyword (for example, morpheme) included in the text data (for example, sentence, paragraph, etc.). ). The data analysis system 1 according to the present embodiment, for example, is an electronic device required for investigating and investigating the cause of a crime or dispute when a crime or legal dispute related to a computer such as unauthorized access or leakage of confidential information occurs. It can be applied to a forensic system (or a discovery support system that supports discovery procedures in US civil lawsuits), which is a technology that collects and analyzes data that is records and reveals its legal evidence.
 具体的に、データ分析システム1は、まず、電子メール等を格納するデータベースから所定数のデータを含むデータ群を訓練データとして抽出する。データ分析システム1は、抽出したデータ群に含まれるデータのすべて若しくは一部のそれぞれに予め定められたデータの分類(所定の事案と関係しているか否か)を示す分類情報を対応付ける。次に、データ分析システム1は、データ(主に、分類情報が未だ対応付けられていない未知データ)と分類情報との関連性(すなわち、未知データと所定の事案との関連性)を評価する。データ分析システム1は、例えば、分類情報との関連性の度合いに応じた重み付け情報が予め対応付けられているキーワードを用い、各データのそれぞれにスコア付けする。上記関連性の度合いとは、例えば、データ分析システム1が「フォレンジックシステム」として実現される場合であれば、「ドキュメント(データ)が本件訴訟に関連する度合い」である。データ分析システム1は、一例として、分類情報(所定の事案)とデータとの関連性が高いほど、高いスコアをデータに対応付ける。 Specifically, the data analysis system 1 first extracts a data group including a predetermined number of data as training data from a database storing e-mails and the like. The data analysis system 1 associates classification information indicating a predetermined data classification (whether or not it is related to a predetermined case) with all or a part of the data included in the extracted data group. Next, the data analysis system 1 evaluates the relevance (that is, the relevance between the unknown data and a predetermined case) between the data (mainly unknown data to which the classification information is not yet associated) and the classification information. . For example, the data analysis system 1 scores each piece of data using a keyword in which weighting information corresponding to the degree of relevance with classification information is associated in advance. For example, if the data analysis system 1 is realized as a “forensic system”, the degree of relevance is “degree of document (data) related to the lawsuit”. For example, the data analysis system 1 associates a higher score with data as the relationship between the classification information (predetermined case) and the data is higher.
 なお、データ分析システム1は、分類情報に関連する複数のキーワードと、複数のキーワードのそれぞれに対応付けられた重み付け情報とを格納するキーワードデータベースを予め備えてもよい。重み付け情報は、対応付けられているキーワードの分類情報との関連性が高いほど、大きな重み付けになる情報である。データ分析システム1は、この重み付け情報を用いて複数のデータそれぞれにスコア付けする。 Note that the data analysis system 1 may previously include a keyword database that stores a plurality of keywords related to the classification information and weighting information associated with each of the plurality of keywords. The weighting information is information that becomes more weighted as the relevance of the associated keyword classification information is higher. The data analysis system 1 scores each of a plurality of data using this weighting information.
 そして、データ分析システム1は、関連性の評価結果が所定の基準を満たしているか否かを判定し(例えば、データに対応付けられたスコアを変数とする指数関数により算出される値(以下、「ランク」という場合がある)が予め定められたランクを超えるか否かを判定する)、所定の基準を満たしている場合に、分類情報が示すデータの分類に抽出したデータが関連していると判断する。更に、データ分析システム1は、関連性が評価されたデータから所定数のデータを再び抽出し、再抽出した当該データと分類情報との関連性を再評価する。データ分析システム1は、再評価結果に応じ、前述の所定の基準を変更する。データ分析システム1は、データの抽出、データと分類情報との関連性の評価、当該評価結果が所定の基準を満たしているか否かの判定を繰り返すことで、分類情報と関連性を有するデータの抽出精度を向上させることができる。 Then, the data analysis system 1 determines whether or not the relevance evaluation result satisfies a predetermined criterion (for example, a value calculated by an exponential function using a score associated with the data as a variable (hereinafter, (It is sometimes called “rank”) whether or not it exceeds a predetermined rank), and when the predetermined criteria are satisfied, the extracted data is related to the classification of the data indicated by the classification information Judge. Further, the data analysis system 1 re-extracts a predetermined number of data from the data whose relevance is evaluated, and re-evaluates the relevance between the re-extracted data and the classification information. The data analysis system 1 changes the above-mentioned predetermined standard according to the re-evaluation result. The data analysis system 1 repeats the data extraction, the evaluation of the relevance between the data and the classification information, and the determination as to whether or not the evaluation result satisfies a predetermined standard. Extraction accuracy can be improved.
 なお、本実施の形態においてサーバーは、1つ以上のサーバーであって、複数のサーバーを含んで構成することもできる。例えば、サーバーは、メールサーバー、ファイルサーバー、又は文書管理サーバー等のデジタル情報を格納可能なサーバーを含む。また、ユーザ端末は、1つ以上のユーザ端末であって、複数のユーザ端末を含んで構成することもできる。例えば、ユーザ端末は、パーソナルコンピュータ、ノートパソコン、タブレットPC、又は携帯電話等の携帯通信端末等を含む。 In the present embodiment, the server is one or more servers, and may be configured to include a plurality of servers. For example, the server includes a server capable of storing digital information such as a mail server, a file server, or a document management server. Moreover, a user terminal is one or more user terminals, Comprising: A several user terminal can also be comprised. For example, the user terminal includes a personal computer, a notebook computer, a tablet PC, or a mobile communication terminal such as a mobile phone.
(データ分析システム1の詳細)
 図1は、本実施の形態に係るデータ分析システムの機能構成ブロックの一例を示す。
(Details of data analysis system 1)
FIG. 1 shows an example of functional configuration blocks of the data analysis system according to the present embodiment.
 データ分析システム1は、データを格納するデータベース100と、データベース100から所定のデータを抽出するデータ抽出部102と、分類情報等の情報を入力する入力装置104と、分類情報を受け付ける分類情報受付部106と、データに分類情報を対応付けるデータ分類部108と、データからデータ要素を抽出するデータ要素抽出部110と、データ要素の評価を実行する要素評価部112と、データと分類情報との関連性を評価するデータ評価部114と、関連性が評価されたデータから所定数のデータを再抽出するデータ再抽出部116と、再抽出されたデータと分類情報との関連性を再評価するデータ再評価部118と、評価結果が所定の基準を満たしているか否か判定する基準判定部120と、再評価結果に基づいて所定の基準を変更する基準変更部122と、再評価結果の分布に対して仮定したモデルのパラメータを決定するパラメータ決定部124とを備える。 The data analysis system 1 includes a database 100 that stores data, a data extraction unit 102 that extracts predetermined data from the database 100, an input device 104 that inputs information such as classification information, and a classification information reception unit that receives classification information 106, a data classification unit 108 that associates classification information with data, a data element extraction unit 110 that extracts data elements from data, an element evaluation unit 112 that performs evaluation of data elements, and the relationship between data and classification information A data evaluation unit 114 for evaluating the re-extraction, a data re-extraction unit 116 for re-extracting a predetermined number of data from the data for which the relevance was evaluated, Based on the evaluation unit 118, the criterion determination unit 120 that determines whether the evaluation result satisfies a predetermined criterion, and the re-evaluation result It includes a reference change unit 122 for changing a predetermined reference, and a parameter determination unit 124 which determines the parameters of the assumed model for the distribution of the re-evaluation.
 また、データ分析システム1は、データに含まれる部分データに所定のデータ要素が出現するか否かを示す出現情報を、当該部分データごとに生成する生成部と、生成部によって生成された出現情報に、所定のデータ要素と当該所定のデータ要素とは異なる他のデータ要素との関係性を反映させることによって、部分データごとに関係性情報を得る関係性反映部と、データを生成したユーザの感情であって、評価に基づいて生じた事象に対する感情を、当該データから抽出する感情抽出部と、所定のパターンと当該所定のパターンが有する概念とを対応付けて記憶するデータベースを参照することによって、当該所定のパターンを含む部分データを要約可能な上位概念を、当該部分データから抽出する概念抽出部とをさらに備えることもできる(いずれも図1において図示されていない)。 In addition, the data analysis system 1 includes a generation unit that generates, for each partial data, appearance information indicating whether or not a predetermined data element appears in the partial data included in the data, and the appearance information generated by the generation unit In addition, by reflecting the relationship between the predetermined data element and another data element different from the predetermined data element, the relationship reflecting unit that obtains the relationship information for each partial data, and the user who generated the data By referring to a database that stores emotions corresponding to events that occur based on evaluations, and an emotion extraction unit that extracts the data from the data, and a predetermined pattern and a concept included in the predetermined pattern in association with each other And a concept extraction unit that extracts a high-level concept capable of summarizing the partial data including the predetermined pattern from the partial data. That (not shown either in FIG. 1).
(データベース100)
 データベース100は、所定数のデータを含むデータ群を有するデータセットを格納する。また、データベース100は、データに含まれるデータ要素であって、要素評価部112における評価結果が対応付けられたデータ要素を格納する。本実施の形態においてデータは、例えば、テキストデータ、画像データ、及び/又は音声データを含む。そして、データがテキストデータである場合、当該テキストデータは、センテンスや複数の形態素、若しくは複数の共起表現を示すデータを含んでいてよい。例えば、テキストデータは、電子メール、プレゼンテーション資料、表計算資料、打ち合わせ資料、契約書、組織図、事業計画書等である。また、データ群は、例えば、複数若しくは所定数のデータを所定のまとまりを有して含むデータの一群である。そして、データセットは、複数のデータ群を有して構成される。データ要素は、例えば単語であって、単語とは、文法上、特定の意味及び機能を有する最小の言語単位である。
(Database 100)
The database 100 stores a data set having a data group including a predetermined number of data. The database 100 stores data elements that are included in the data and that are associated with the evaluation results in the element evaluation unit 112. In the present embodiment, the data includes, for example, text data, image data, and / or audio data. When the data is text data, the text data may include data indicating a sentence, a plurality of morphemes, or a plurality of co-occurrence expressions. For example, the text data is e-mail, presentation material, spreadsheet material, meeting material, contract, organization chart, business plan, and the like. In addition, the data group is a group of data including, for example, a plurality or a predetermined number of data having a predetermined group. The data set includes a plurality of data groups. The data element is, for example, a word, and the word is a minimum language unit having a specific meaning and function in the grammar.
 なお、データに対して算出されるスコア(例えば、データがドキュメントファイルの場合、スコアは「ドキュメントスコア」に対応する。評価結果)とは、例えば、当該データと所定の事案との関連性の高低を示す数値である。数値が大きいほど、関連性が高いことを示す。例えば、データ分析システム1は、予め定められた分類情報に関連する複数の単語の組合せに、予め定められた分類情報との関連性の高低を示すスコアを対応づけて格納する辞書としての組合せ格納部を備える。そして、データ分析システム1は、所定のファイルが選択された場合に、形態素解析に基づいて当該ファイル内の文章を解析し、組合せ格納部に格納されている複数の単語の組合せが選択されたファイルに含まれているか否かを判断する。データ分析システム1は、組合せ格納部に格納されている単語の組合せが、選択されたファイルに含まれていると判断した場合、複数の単語それぞれの間の距離、複数の単語の語順、及び/又は複数の単語が同一文章に含まれているか否かに基づいて、予め定められた分類情報に対する当該ファイルの関連性の高低を判断する。そして、データ分析システム1は、判断結果を示す情報(すなわち、予め定められた分類情報に対する関連性の高低を示す情報)を選択されたファイルに対応づける。 Note that a score calculated for data (for example, if the data is a document file, the score corresponds to “document score”. Evaluation result) is, for example, the level of relevance between the data and a predetermined case. Is a numerical value indicating The larger the value, the higher the relevance. For example, the data analysis system 1 stores a combination as a dictionary that stores a combination of a plurality of words related to predetermined classification information in association with a score indicating the level of relevance with the predetermined classification information. A part. Then, when a predetermined file is selected, the data analysis system 1 analyzes a sentence in the file based on morphological analysis, and a file in which a combination of a plurality of words stored in the combination storage unit is selected It is judged whether it is included in. When the data analysis system 1 determines that the combination of words stored in the combination storage unit is included in the selected file, the distance between each of the plurality of words, the word order of the plurality of words, and / or Alternatively, the level of relevance of the file with respect to predetermined classification information is determined based on whether or not a plurality of words are included in the same sentence. Then, the data analysis system 1 associates information indicating the determination result (that is, information indicating the level of relevance with respect to predetermined classification information) with the selected file.
 なお、関連性とは、一例として、データと分類情報との関係の親密度である。データと分類情報との親密度が高いほど関係性は高くなり、親密度が低いほど関係性は低くなる。例えば、所定の分類情報に属する複数のキーワードを予め設定し、データ中に含まれる複数の単語と当該複数のキーワードとを比較した場合、一致するキーワードが多いほど親密度が高くなる。 Note that the relationship is, for example, the closeness of the relationship between data and classification information. The higher the closeness between the data and the classification information, the higher the relationship, and the lower the closeness, the lower the relationship. For example, when a plurality of keywords belonging to predetermined classification information are set in advance and the plurality of words included in the data are compared with the plurality of keywords, the familiarity increases as the number of matching keywords increases.
 データベース100は、データ抽出部102からの働きかけに応じ、格納しているデータをデータ抽出部102に供給する。なお、データベース100は、データ分析システム1の外部に設けることもできる。例えば、データベース100とデータベース100を除くデータ分析システム1の他の構成要素とは、インターネット等の通信ネットワーク、又はLAN等の有線若しくは無線のネットワーク等により相互に通信可能に接続されてもよい。 The database 100 supplies the stored data to the data extraction unit 102 in response to an action from the data extraction unit 102. The database 100 can also be provided outside the data analysis system 1. For example, the database 100 and other components of the data analysis system 1 excluding the database 100 may be connected to each other via a communication network such as the Internet or a wired or wireless network such as a LAN.
(データ抽出部102)
 データ抽出部102は、データベース100に格納されているデータセットが有するデータ群をユーザによる分類対象(訓練データ)として抽出する。データ抽出部102は、抽出したデータ群をデータ分類部108に供給する。
(Data extraction unit 102)
The data extraction unit 102 extracts a data group included in the data set stored in the database 100 as a classification target (training data) by the user. The data extraction unit 102 supplies the extracted data group to the data classification unit 108.
(入力装置104、分類情報受付部106)
 入力装置104は、ユーザから所定の情報としての分類情報の入力を受ける。そして、分類情報受付部106は、データの分類を示す分類情報を、入力装置104を介してユーザから受け付ける。入力装置104は、例えば、キーボード、マウス、及び/又はタッチパネル等である。なお、分類情報とは、例えば、訴訟やカルテル及び/又はこれらの進行の所定の段階等の分類対象をそれぞれ識別する情報である。分類情報は、一例として、分類対象を一意に識別する識別子が対応付けられる。一例として、分類情報は、当該ドキュメントが本件訴訟と関連することを示す「Responsive」、特に関連が深いことを示す「Hot」、無関係であることを示す「Non-Responsive」などのタグである。また、分類情報は、例えば、あるユーザが組織の機密情報を漏洩しようとしていることを当該メールが示す「Responsive」、当該漏洩とは無関係であることを示す「Non-Responsive」などのタグである。
(Input device 104, classification information receiving unit 106)
The input device 104 receives input of classification information as predetermined information from a user. Then, the classification information reception unit 106 receives classification information indicating the classification of data from the user via the input device 104. The input device 104 is, for example, a keyboard, a mouse, and / or a touch panel. The classification information is information for identifying classification targets such as lawsuits, cartels, and / or predetermined stages of their progress, for example. As an example, the classification information is associated with an identifier that uniquely identifies the classification target. As an example, the classification information is a tag such as “Responsive” indicating that the document is related to the lawsuit, “Hot” indicating that the document is particularly related, and “Non-Responsive” indicating that the document is irrelevant. In addition, the classification information is a tag such as “Responsive” indicating that a certain user is attempting to leak confidential information of the organization, “Non-Responsive” indicating that the user is unrelated to the leakage, for example. .
(データ分類部108)
 データ分類部108は、データ群に含まれる所定数のデータに、分類情報受付部106が所定数のデータごと若しくは所定数のデータのそれぞれごとに受け付けた分類情報を対応付けることによって当該データを分類する。データ分類部108は、分類情報を対応付けたデータをデータ要素抽出部110に供給する。
(Data classification unit 108)
The data classification unit 108 classifies the data by associating the predetermined number of data included in the data group with the classification information received by the classification information receiving unit 106 for each predetermined number of data or for each predetermined number of data. . The data classification unit 108 supplies data associated with the classification information to the data element extraction unit 110.
(データ要素抽出部110)
 データ要素抽出部110は、データ分類部108によって分類されたデータからデータ要素(例えば、形態素や単語)を抽出する。データ要素抽出部110は、抽出したデータ要素を要素評価部112に供給する。
(Data element extraction unit 110)
The data element extraction unit 110 extracts data elements (for example, morphemes and words) from the data classified by the data classification unit 108. The data element extraction unit 110 supplies the extracted data elements to the element evaluation unit 112.
(要素評価部112)
 要素評価部112は、データ要素抽出部110が抽出したデータ要素を所定の要素基準にしたがって評価する。具体的に、要素評価部112は、データ要素と当該データ要素を含むデータに対応付けられた分類情報との相互依存の尺度に応じてデータ要素を評価する。例えば、要素評価部112は、データ要素と当該データ要素を含むデータに対応付けられた分類情報との依存関係を表わす伝達情報量を算出し、算出した伝達情報量を所定の要素基準の1つとして用いることによって当該データ要素を評価する(例えば、伝達情報量は、所定の単語の出現確率を表す確率変数と、所定の分類情報の出現確率を表す確率変数とを用い、所定の定義式から算出される。)。一例として、要素評価部112は、算出した伝達情報量の値が大きいほど、データ要素が所定の分類情報の特徴を表す単語と評価する。
(Element evaluation unit 112)
The element evaluation unit 112 evaluates the data element extracted by the data element extraction unit 110 according to a predetermined element criterion. Specifically, the element evaluation unit 112 evaluates the data element according to a measure of mutual dependence between the data element and the classification information associated with the data including the data element. For example, the element evaluation unit 112 calculates a transmission information amount representing a dependency relationship between the data element and the classification information associated with the data including the data element, and calculates the calculated transmission information amount as one of predetermined element criteria. (For example, the amount of transmitted information is calculated from a predetermined definition using a random variable that represents the appearance probability of a predetermined word and a random variable that represents the appearance probability of predetermined classification information. Calculated.) As an example, the element evaluation unit 112 evaluates the data element as a word representing the characteristics of the predetermined classification information as the calculated value of the transmitted information amount increases.
(データ評価部114)
 データ評価部114は、データセットに含まれるデータ(未知データ)と分類情報との関連性を、データ分類部108による分類結果に基づいて評価する。例えば、データ評価部114は、データと分類情報に予め対応付けられているデータ若しくはキーワードとの間の距離、語順、分類情報に予め対応付けられているデータの出現回数や出現頻度、及び/又は複数の当該データが同一センテンスに含まれているか否かに基づいて、予め定められた分類情報に対する当該データの関連性の高低を判断する。そして、データ評価部114は、評価結果を示す情報(すなわち、予め定められた分類情報に対する関連性の高低を示す情報)に基づいて、データの評価を実行する。
(Data evaluation unit 114)
The data evaluation unit 114 evaluates the relationship between the data (unknown data) included in the data set and the classification information based on the classification result by the data classification unit 108. For example, the data evaluation unit 114 may include the distance between the data and the data or keywords that are associated with the classification information in advance, the word order, the number of appearances and the appearance frequency of the data that is associated with the classification information in advance, and / or Based on whether or not a plurality of the data are included in the same sentence, the level of relevance of the data with respect to predetermined classification information is determined. Then, the data evaluation unit 114 performs data evaluation based on information indicating the evaluation result (that is, information indicating the level of relevance with respect to predetermined classification information).
 具体的には、データ評価部114は、データに含まれる部分データに所定のデータ要素が出現するか否かを示す出現情報を生成する。より具体的には、データ評価部114は、例えば、データ(文書)に含まれるセンテンスまたは段落に所定のキーワード(形態素)が含まれるか否かを示すキーワードベクトル(出現情報)を生成する。上記キーワードベクトルは、当該キーワードベクトルのそれぞれの要素が「0」または「1」の値をとることによって、当該要素に対応付けられた所定のキーワードが、上記データに含まれるか否かを示すベクトルである(いわゆる「bag of words」)。例えば、上記データに「価格」というキーワードが含まれている場合、データ評価部114は、上記キーワードベクトルの上記「価格」に対応する要素を「0」から「1」に変更する。そして、データ評価部114は、生成したキーワードベクトルと、各キーワードの評価値(重み)との内積を計算することにより、各データのスコアSを算出する(下記〔数1〕参照)。 Specifically, the data evaluation unit 114 generates appearance information indicating whether or not a predetermined data element appears in the partial data included in the data. More specifically, the data evaluation unit 114 generates, for example, a keyword vector (appearance information) indicating whether or not a predetermined keyword (morpheme) is included in a sentence or paragraph included in the data (document). The keyword vector is a vector indicating whether or not a predetermined keyword associated with the element is included in the data when each element of the keyword vector takes a value of “0” or “1”. (So-called “bag of words”). For example, when the keyword “price” is included in the data, the data evaluation unit 114 changes the element corresponding to the “price” of the keyword vector from “0” to “1”. Then, the data evaluation unit 114 calculates the score S of each data by calculating the inner product of the generated keyword vector and the evaluation value (weight) of each keyword (see [Equation 1] below).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 ここで、sはキーワードベクトルを表し、wは重みベクトルを表す。なお、Tは転置を意味する。 Here, s represents a keyword vector, and w represents a weight vector. T means transposition.
 また、データ評価部114は、所定の行為(例えば、犯罪行為等)が進展する各段階を示す指標であるフェーズ(例えば、関係構築フェーズ、準備フェーズ、及び/又は実行フェーズ等)ごとに関係性を評価してもよい。例えば、フェーズは、所定の行為が進展する各段階を示す(所定の行為の進展に応じて分類する)指標である。例えば、「Relationship Building」(関係構築)というフェーズは、Competition(競合)というフェーズの前提となる段階であって、顧客・競合と関係を構築する段階をいう。また、「Preparation」(準備)というフェーズは、競合他社(第三者であってもよい)と競合に関する情報を交換する段階をいう。さらに、「Competition」(競合)というフェーズは、顧客へ価格を提示し、フィードバックを得て、当該フィードバックに関して競合とコミュニケーションを取る段階をいう。ここで、上記「Relationship Building」(関係構築)というフェーズにおいては、「顧客からの引き合い」という行為(訴訟または不正調査の原因となる所定の行為)が生じることが一般的である。また、上記「Preparation」(準備)というフェーズにおいては「競合の生産状況の入手」という行為(訴訟または不正調査の原因となる定の行為)が生じることが多い。その他にも、上記フェーズのそれぞれに対応付けられるように、訴訟または不正調査の原因となり得る一般的な行為が明らかである。例えば、データ評価部114は、フェーズごとに予め重みが対応付けられたキーワードを用い、フェーズ毎にデータと分類情報との関連性を評価する。データ評価部114は、フェーズに含まれるキーワードの重みを合算して得られる計算値をフェーズに対応付けることで、フェーズ毎に評価対象のデータのスコアを計算する。そして、データ評価部114は、各フェーズが発生しているか否かを表す予め定められた閾値とスコアとを比較して、各フェーズが発生しているか否かを評価する。 In addition, the data evaluation unit 114 has a relationship for each phase (for example, a relationship building phase, a preparation phase, and / or an execution phase) that is an index indicating each stage in which a predetermined action (for example, criminal action) progresses. May be evaluated. For example, the phase is an index indicating each stage where the predetermined action progresses (classified according to the progress of the predetermined action). For example, the phase “Relationship Building” (relationship building) is a premise of the phase of competition (competition), and is a step of building a relationship with a customer / competition. The “Preparation” phase refers to a stage in which information regarding competition is exchanged with competitors (which may be third parties). Furthermore, the phase of “Competition” (competition) refers to the stage of presenting a price to a customer, obtaining feedback, and communicating with the competitor regarding the feedback. Here, in the phase of “RelationshipingBuilding” (relationship building), an action “inquiry from a customer” (a predetermined action that causes a lawsuit or fraud investigation) generally occurs. Further, in the “preparation” phase, an action “obtaining competitive production status” often occurs (a certain action causing a lawsuit or fraud investigation). In addition, general actions that can cause litigation or fraud investigations are evident, as associated with each of the above phases. For example, the data evaluation unit 114 uses a keyword in which a weight is associated with each phase in advance, and evaluates the relevance between the data and the classification information for each phase. The data evaluation unit 114 calculates a score of data to be evaluated for each phase by associating a calculated value obtained by adding the weights of keywords included in the phase with the phase. Then, the data evaluation unit 114 evaluates whether each phase has occurred by comparing a score with a predetermined threshold value indicating whether each phase has occurred.
 また、データ評価部114は、要素評価部112によって評価されたデータ要素を用いることによって関連性を評価することもできる。更に、データ評価部114は、データに含まれる第1データ要素と第2データ要素との相関に基づいて関係性を評価することもできる(すなわち、当該評価処理は、共起処理(つまり、一の単語と他の単語との相関を考慮する処理)である。)。例えば、データ評価部114は、データに含まれるテキストデータの複数のデータ要素(例えば、形態素や単語)から第1データ要素と第1データ要素とは異なる第2データ要素とを抽出する。なお、データ評価部114とデータ再評価部116とは、実質的に同一機能を有し、双方ともスコアを計算する。データ評価部114も、具体的にはスコアを算出することによって関係性を評価する。そして、データ評価部114は、第1データ要素と第2データ要素とのテキストデータ中における位置、両データ要素の間の距離、及び/又は予め分類情報に対応付けられたキーワードと第1データ要素及び第2データ要素との一致度・近似度等(相関)に応じ、関連性の高低を評価する。 The data evaluation unit 114 can also evaluate the relevance by using the data elements evaluated by the element evaluation unit 112. Furthermore, the data evaluation unit 114 can also evaluate the relationship based on the correlation between the first data element and the second data element included in the data (that is, the evaluation process is a co-occurrence process (that is, one This is a process that takes into account the correlation between other words and other words)). For example, the data evaluation unit 114 extracts a first data element and a second data element different from the first data element from a plurality of data elements (for example, morphemes and words) of text data included in the data. Note that the data evaluation unit 114 and the data reevaluation unit 116 have substantially the same function, and both calculate scores. Specifically, the data evaluation unit 114 also evaluates the relationship by calculating a score. Then, the data evaluation unit 114 determines the position of the first data element and the second data element in the text data, the distance between the two data elements, and / or the keyword and the first data element previously associated with the classification information. In addition, the level of relevance is evaluated according to the degree of coincidence and approximation with the second data element (correlation).
 また、データ評価部114は、データに含まれるテキストデータが示すテキストに含まれるセンテンスと分類情報との関連性を評価し、当該評価結果に基づいてデータと当該分類情報との関連性を評価することもできる。例えば、データ評価部114は、センテンスに含まれる単語と分類情報に予め対応付けられたキーワードとの一致度・近似度等に応じ、関連性を評価する。データ評価部114は、評価結果を示す情報をデータ再抽出部116及び基準判定部120に供給する。 In addition, the data evaluation unit 114 evaluates the relationship between the sentence included in the text indicated by the text data included in the data and the classification information, and evaluates the relationship between the data and the classification information based on the evaluation result. You can also. For example, the data evaluation unit 114 evaluates the relevance according to the degree of coincidence / approximation between the word included in the sentence and the keyword previously associated with the classification information. The data evaluation unit 114 supplies information indicating the evaluation result to the data re-extraction unit 116 and the reference determination unit 120.
(データ再抽出部116)
 データ再抽出部(データ選出部)116は、関連性が評価されたデータから所定数のデータを再抽出(選出)する。データ再抽出部116は、抽出したデータをデータ再評価部118、基準変更部122、及び/又はパラメータ決定部124に供給する。
(Data re-extraction unit 116)
The data re-extraction unit (data selection unit) 116 re-extracts (selects) a predetermined number of data from the data evaluated for relevance. The data re-extraction unit 116 supplies the extracted data to the data re-evaluation unit 118, the reference change unit 122, and / or the parameter determination unit 124.
(データ再評価部118)
 データ再評価部118は、再抽出されたデータと分類情報との関連性を再評価する。データ再評価部118は、上述したデータ評価部114と同様にして関連性の再評価を実行する。また、データ再評価部118は、データと分類情報(すなわち、所定の事案)との結びつきの強さを示すスコアを算出することによって、関係性を評価することもできる。例えば、データ再評価部118は、分類情報に予め対応付けられているキーワードとデータに含まれるテキストデータとを対比し、その一致度や近似度等を算出し、算出した結果に応じて関係性の高低を評価する。
(Data re-evaluation unit 118)
The data re-evaluation unit 118 re-evaluates the relationship between the re-extracted data and the classification information. The data re-evaluation unit 118 performs re-evaluation of relevance in the same manner as the data evaluation unit 114 described above. The data re-evaluation unit 118 can also evaluate the relationship by calculating a score indicating the strength of the association between the data and the classification information (that is, a predetermined case). For example, the data re-evaluation unit 118 compares the keyword previously associated with the classification information with the text data included in the data, calculates the degree of coincidence, the degree of approximation, and the like, and determines the relationship according to the calculated result. Assess the height of the.
 更に、データ再評価部118は、データ評価部114と同様にして、所定の行為が進展する各段階を示す指標であるフェーズごとに関係性を評価することもできる。また、データ再評価部118は、要素評価部112によって評価されたデータ要素を用いることによって関連性を評価することもできる。更に、データ再評価部118は、データに含まれる第1データ要素と第2データ要素との相関に基づいて関係性を評価することもできる。また、データ再評価部118は、データに含まれるテキストデータが示すテキストに含まれるセンテンスと分類情報との関連性を評価し、当該評価結果に基づいてデータと当該分類情報との関連性を評価することもできる。データ再評価部118は、再評価結果を示す情報を基準変更部122及びパラメータ決定部124に供給する。 Furthermore, the data re-evaluation unit 118 can also evaluate the relationship for each phase, which is an index indicating each stage where a predetermined action progresses, in the same manner as the data evaluation unit 114. The data re-evaluation unit 118 can also evaluate the relevance by using the data elements evaluated by the element evaluation unit 112. Further, the data re-evaluation unit 118 can also evaluate the relationship based on the correlation between the first data element and the second data element included in the data. In addition, the data re-evaluation unit 118 evaluates the relationship between the sentence included in the text indicated by the text data included in the data and the classification information, and evaluates the relationship between the data and the classification information based on the evaluation result. You can also The data reevaluation unit 118 supplies information indicating the reevaluation result to the reference change unit 122 and the parameter determination unit 124.
(基準判定部120)
 基準判定部120は、データ評価部114による評価結果が、所定の基準(若しくは所定の閾値)を満たしているか否かを判定する。例えば、基準判定部120は、評価結果に対応するスコアが、予め定められた関数曲線上に対応するか否か、当該関数曲線上に当該スコアが対応しない場合、当該曲線と当該スコアとのずれの程度等を判定する。基準判定部120は、フェーズごとに設けられた所定の基準を満たしているか否かを当該フェーズごとに判定することもできる。基準判定部120は、判定結果を示す情報を基準変更部122に供給する。
(Reference determination unit 120)
The reference determination unit 120 determines whether the evaluation result by the data evaluation unit 114 satisfies a predetermined reference (or a predetermined threshold). For example, the criterion determination unit 120 determines whether or not the score corresponding to the evaluation result corresponds to a predetermined function curve, and if the score does not correspond to the function curve, the deviation between the curve and the score Determining the degree of The reference determination unit 120 can also determine for each phase whether or not a predetermined standard provided for each phase is satisfied. The reference determination unit 120 supplies information indicating the determination result to the reference change unit 122.
(基準変更部122、パラメータ決定部124)
 基準変更部122は、データ再評価部118による再評価結果に基づいて、基準判定部120において用いられる所定の基準を変更する。例えば、基準変更部122は、評価結果に対応するスコアが予め定められた関数曲線上に対応していない場合、基準判定部120において当該スコアが関数曲線上に対応する基準であると判定可能になるように所定の基準を変更する。
(Reference changing unit 122, parameter determining unit 124)
The reference change unit 122 changes a predetermined reference used in the reference determination unit 120 based on the re-evaluation result by the data re-evaluation unit 118. For example, when the score corresponding to the evaluation result does not correspond to a predetermined function curve, the reference changing unit 122 can determine that the score is a reference corresponding to the function curve in the reference determining unit 120. The predetermined standard is changed so that
 また、パラメータ決定部124は、再評価結果の分布に対して回帰分析等を用いて仮定したモデル(例えば、回帰分析により算出されるフィッティングカーブ)のパラメータを、当該再評価結果に基づいて決定する。基準変更部122は、パラメータ決定部124において決定されたパラメータによって同定されるモデルを用いて、所定の基準を変更する。ここで、モデルとは、指数型分布族に属する関数(例えば、y=eαx+βで表される関数(ただし、eは自然対数の底であり、α及びβは0以上の数である)であり、パラメータ決定部124は、当該関数の指数パラメータ(例えば、前記した関数のαの値)を決定する。 In addition, the parameter determination unit 124 determines a parameter of a model (for example, a fitting curve calculated by regression analysis) assumed using regression analysis or the like for the distribution of the reevaluation results based on the reevaluation results. . The reference changing unit 122 changes a predetermined reference using the model identified by the parameter determined by the parameter determining unit 124. Here, the model is a function belonging to the exponential distribution family (for example, a function represented by y = e αx + β (where e is the base of natural logarithm, α and β are numbers greater than or equal to 0)) The parameter determination unit 124 determines an exponent parameter of the function (for example, the value of α of the function described above).
 また、基準変更部122は、スコアを所定の規則(例えば、昇順若しくは降順)にしたがって並べ替えることにより決定されるランク(例えば、スコアが高いほどランクが高くなる)から一意に特定可能な閾値(例えば、複数のランクのうち、データの抽出に最も適したランクに対応する閾値)を、所定の基準として変更することもできる。更に、基準変更部122は、フェーズごとに設けられた所定の基準をそれぞれ変更することもできる。 Further, the reference changing unit 122 can uniquely specify a threshold value (for example, the higher the score, the higher the rank) determined by rearranging the scores according to a predetermined rule (for example, ascending or descending order). For example, a threshold value corresponding to a rank most suitable for data extraction among a plurality of ranks) can be changed as a predetermined reference. Furthermore, the reference changing unit 122 can also change predetermined references provided for each phase.
(データ要素格納部126)
 データ要素格納部126は、データ要素抽出部110によって抽出されたデータ要素と、要素評価部112によって当該データ要素が評価された結果とを対応付けて、データベース100に格納する。
(Data element storage unit 126)
The data element storage unit 126 associates the data element extracted by the data element extraction unit 110 with the result of evaluation of the data element by the element evaluation unit 112 and stores the data element in the database 100.
(生成部、関係性反映部)
 生成部は、データに含まれる部分データに所定のデータ要素が出現するか否かを示す出現情報を、当該部分データごとに生成する。より具体的には、生成部は、例えば、データ(文書)に含まれるセンテンスまたは段落に所定のキーワード(形態素)が含まれるか否かを示すキーワードベクトル(出現情報)を、当該センテンスまたは段落ごとに生成する。上記キーワードベクトルは、当該キーワードベクトルのそれぞれの要素が「0」または「1」の値をとることによって、当該要素に対応付けられた所定のキーワードが、上記データに含まれるか否かを示すベクトルである。例えば、上記データに含まれる2番目のセンテンスまたは段落に、「価格」というキーワードが含まれている場合、生成部は、上記キーワードベクトルの上記「価格」に対応する要素を「0」から「1」に変更する。
(Generation unit, relationship reflection unit)
The generation unit generates appearance information indicating whether or not a predetermined data element appears in the partial data included in the data for each partial data. More specifically, for example, the generation unit generates a keyword vector (appearance information) indicating whether or not a predetermined keyword (morpheme) is included in a sentence or paragraph included in data (document) for each sentence or paragraph. To generate. The keyword vector is a vector indicating whether or not a predetermined keyword associated with the element is included in the data when each element of the keyword vector takes a value of “0” or “1”. It is. For example, when the keyword “price” is included in the second sentence or paragraph included in the data, the generation unit changes the element corresponding to the “price” of the keyword vector from “0” to “1”. To "".
 関係性反映部は、生成部によって生成された出現情報に、所定のデータ要素と当該所定のデータ要素とは異なる他のデータ要素との関係性を反映させることによって、部分データごとに関係性情報を得る。より具体的には、関係性反映部は、例えば、生成部によって生成されたキーワードベクトルを、上記所定のキーワードと、当該所定のキーワードとは異なる他のキーワードとの相関を示す相関マトリクスにそれぞれ乗じることによって、上記関係性情報として、上記センテンスごとに相関ベクトル3を得る。上記相関マトリクスは、例えば「価格」というキーワードがセンテンスに出現した場合、当該センテンスにおいて、当該キーワードに対する他のキーワード(例えば「調整」)の出現しやすさ(すなわち、相関)を、当該相関マトリクスのそれぞれの要素において表す正方行列である。関係性反映部は、上記相関ベクトル3を算出部14に出力する。 The relationship reflection unit reflects the relationship between the predetermined data element and another data element different from the predetermined data element in the appearance information generated by the generation unit, thereby generating the relationship information for each partial data. Get. More specifically, the relationship reflection unit, for example, multiplies the keyword vector generated by the generation unit by a correlation matrix indicating the correlation between the predetermined keyword and another keyword different from the predetermined keyword. Thus, the correlation vector 3 is obtained for each sentence as the relationship information. For example, when the keyword “price” appears in a sentence, the correlation matrix indicates the likelihood (that is, the correlation) that another keyword (for example, “adjustment”) appears for the keyword in the sentence. It is a square matrix represented in each element. The relationship reflection unit outputs the correlation vector 3 to the calculation unit 14.
 なお、上記相関マトリクスは、所定の文書データを所定数だけ含む学習用データセットを用いて、あらかじめ最適化されている。例えば、あるセンテンスにおいて「価格」というキーワードが出現する場合、当該キーワードに対する他のキーワードの出現数を0~1の間に正規化した値(すなわち、最尤推定値)が、上記相関マトリクスのそれぞれの要素に格納されている(したがって、上記相関マトリクスの各列に対する総和は1になる)。これにより、データ分析システム1は、上記相関ベクトルを最適に計算することができる。 The correlation matrix is optimized in advance using a learning data set including a predetermined number of predetermined document data. For example, when a keyword “price” appears in a certain sentence, a value obtained by normalizing the number of occurrences of other keywords with respect to the keyword between 0 and 1 (that is, a maximum likelihood estimated value) (Therefore, the sum for each column of the correlation matrix is 1). Thereby, the data analysis system 1 can optimally calculate the correlation vector.
 データ評価部114および/またはデータ再評価部116は、関係性反映部によって得られた関係性情報(相関マトリクス)にさらに基づいて、上記関連度を評価することができる。例えば、データ評価部114および/またはデータ再評価部116は、下記の〔数1〕に示されるように、関係性評価部によって得られた全ての相関ベクトルについて合算した値に基づいて、データと所定の事案との関連度を示すスコアを、当該データごとに算出する。より具体的には、データ評価部114および/またはデータ再評価部116は、前述した〔数1〕に代えて、下記の〔数2〕に示されるように、上記合算した値(縦ベクトルで表される)と、上記所定のキーワードに対する重みを示す重みベクトルW(横ベクトルで表される)との内積を算出することによって、上記スコアをデータごとに算出することができる。 The data evaluation unit 114 and / or the data reevaluation unit 116 can evaluate the relevance based further on the relationship information (correlation matrix) obtained by the relationship reflection unit. For example, the data evaluation unit 114 and / or the data re-evaluation unit 116, as shown in [Equation 1] below, based on the sum of all correlation vectors obtained by the relationship evaluation unit, A score indicating the degree of association with a predetermined case is calculated for each piece of data. More specifically, the data evaluation unit 114 and / or the data reevaluation unit 116 replaces the above-described [Equation 1] with the above summation value (vertical vector) as shown in [Equation 2] below. The score can be calculated for each data by calculating the inner product of the weight vector W (represented by a horizontal vector) indicating the weight for the predetermined keyword.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 ここで、上記〔数2〕において、Cは相関マトリクスを表し、sはs番目のキーワードベクトル2を表す。また、TFnorm(上記合算した値)は、下記の〔数3〕に示されるように計算する。 Here, in [Expression 2] above, C represents a correlation matrix, and s s represents the s-th keyword vector 2. Also, TFnorm (the above summed value) is calculated as shown in the following [Equation 3].
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 ここで、上記〔数3〕において、TFはi番目のキーワードの出現頻度(Term Frequency)を表し、sjsは上記s番目のキーワードベクトル2のj番目の要素を表す。 Here, in the above [Equation 3], TF i represents the appearance frequency (Term Frequency) of the i-th keyword, and s js represents the j-th element of the s-th keyword vector 2.
 上記〔数2〕および〔数3〕をまとめると、データ評価部114および/またはデータ再評価部116は、以下の〔数4〕を計算することによってデータごとに上記スコアを算出する。 Summarizing the above [Equation 2] and [Equation 3], the data evaluation unit 114 and / or the data reevaluation unit 116 calculates the above score for each data by calculating the following [Equation 4].
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 ここで、上記〔数4〕において、wは上記重みベクトルWのi番目の要素である。 Here, in the above [Equation 4], w i is the i-th element of the weight vector W.
(感情抽出部)
 感情抽出部は、データが事象(ユーザの評価対象となる出来事を広く指す)に対するユーザの評価を少なくとも含むデータである場合に、当該データを生成したユーザの感情であって、上記評価に基づいて生じた上記事象に対する感情を、当該データから抽出する。ここで、ユーザが「ある小説を読んだ」という事象に対して「おもしろかった」という評価をし、当該評価に基づいて(作者の作風などが)「好き」というポジティブな感情を抱いた場合に、当該小説のレビューとして「とてもおもしろかったです。家族に勧めようと思います」というテキスト(データ)をメールまたはウェブページ(例えば、SNSを提供するページなど)に記載した例を考える。
(Emotion extraction part)
When the data is data including at least a user's evaluation of an event (which broadly indicates an event to be evaluated by the user), the emotion extraction unit is the emotion of the user who generated the data, and is based on the evaluation The emotion for the event that has occurred is extracted from the data. Here, when the user evaluates “It was interesting” for the event “I read a novel”, and based on the evaluation (such as the author's style) has a positive feeling of “I like” As a review of the novel, consider an example in which text (data) “It was very interesting. I would recommend it to my family” was written in an email or a web page (for example, a page that provides SNS).
 まず、感情抽出部は、上記テキストに含まれるキーワードがデータ要素としてデータベース100に格納されているか否かを判定する。上記例において、「おもしろかった」というデータ要素に「+1.2」という正値(感情評価)が対応付けられてデータベース100に予め格納されている場合、感情抽出部は、「+1.2」を当該テキストから感情を抽出した結果とする。また、「勧めよう」(「勧める」の活用形)というデータ要素に「+0.8」という正値(感情評価)が対応付けられてデータベース100にさらに格納されている場合、感情抽出部は、「+2.0(=+1.2+0.8)」を当該テキストから感情を抽出した結果とする。 First, the emotion extraction unit determines whether or not the keyword included in the text is stored in the database 100 as a data element. In the above example, when a positive value (emotion evaluation) of “+1.2” is associated with the data element “interesting” and stored in the database 100 in advance, the emotion extraction unit sets “+1.2”. The result is an emotion extracted from the text. When a positive value (emotion evaluation) of “+0.8” is associated with the data element “Let's recommend” (utilized form of “recommend”) and further stored in the database 100, the emotion extraction unit “+2.0 (= + 1.2 + 0.8)” is the result of extracting the emotion from the text.
 データ評価部114および/またはデータ再評価部116は、上記感情抽出部による感情抽出結果に基づいて、データと所定の事案との関連度を評価することができる。例えば、データ評価部114による評価結果がスコアとして「+10」であり、上記感情抽出結果が「+2.0」であった場合、データ評価部114は、「+20(=10×2)」を評価結果とすることができる。これにより、データ分析システム1は、データと所定の事案との関連性をより正確に評価することができる。 The data evaluation unit 114 and / or the data reevaluation unit 116 can evaluate the degree of association between data and a predetermined case based on the emotion extraction result by the emotion extraction unit. For example, when the evaluation result by the data evaluation unit 114 is “+10” as the score and the emotion extraction result is “+2.0”, the data evaluation unit 114 evaluates “+20 (= 10 × 2)”. Can be the result. Thereby, the data analysis system 1 can more accurately evaluate the relationship between the data and the predetermined case.
(概念抽出部)
 概念抽出部は、所定のパターンと当該所定のパターンが有する概念とを対応付けて記憶するデータベース100を参照することによって、当該所定のパターンを含む部分データ(例えば、センテンス、段落など)を要約可能な上位概念を、当該部分データから抽出する。例えば、メールに「A社に会計システムを販売する」というセンテンスが含まれていた場合、概念抽出部は、「会計システム」、「販売する」というキーワードがデータベース100に登録されているか否かを確認する。「会計システム」の上位概念として「システム」がデータベース100に登録されており、「販売する」の上位概念として「導入する」が登録されている場合、概念抽出部は、当該センテンスの上位概念として「システムを導入する」を抽出する。これにより、データ分析システム1は、キーワードの些末な差異(すなわち、本質的でない差)に悪影響を受けることなく、データと所定の事案との関連性を正確に評価することができる。
(Concept extraction unit)
The concept extraction unit can summarize partial data (for example, sentences, paragraphs, etc.) including the predetermined pattern by referring to the database 100 that stores the predetermined pattern and the concept of the predetermined pattern in association with each other. The superordinate concept is extracted from the partial data. For example, when a sentence “sell accounting system to company A” is included in the email, the concept extraction unit checks whether the keywords “accounting system” and “sell” are registered in the database 100. Check. When “system” is registered in the database 100 as a superordinate concept of “accounting system” and “introduction” is registered as a superordinate concept of “sell”, the concept extraction unit displays the superordinate concept of the sentence. Extract “Install system”. As a result, the data analysis system 1 can accurately evaluate the relevance between the data and the predetermined case without being adversely affected by minor differences in keywords (that is, non-essential differences).
(データ分析方法の概要)
 図2は、本発明の実施の形態に係るデータ分析システムの処理の流れの一例を示す。
(Outline of data analysis method)
FIG. 2 shows an example of the processing flow of the data analysis system according to the embodiment of the present invention.
 まず、データ抽出部102は、データベース100に格納されているデータセットからデータ群を抽出する(ステップ10。以下、ステップを「S」と表す。)。データ抽出部102は、抽出したデータ群をデータ分類部108に供給する。次に、分類情報受付部106は、入力装置104を介し、ユーザから分類情報を受け付ける(S15)。分類情報受付部106は、受け付けた分類情報をデータ分類部108に供給する。そして、データ分類部108は、データ群に含まれる所定数のデータそれぞれに、分類情報受付部106が受け付けた分類情報を対応付ける(S20)。データ分類部108は、対応付けた結果を示す情報をデータ評価部114に供給する。 First, the data extraction unit 102 extracts a data group from the data set stored in the database 100 (step 10; hereinafter, step is represented as “S”). The data extraction unit 102 supplies the extracted data group to the data classification unit 108. Next, the classification information receiving unit 106 receives classification information from the user via the input device 104 (S15). The classification information receiving unit 106 supplies the received classification information to the data classification unit 108. Then, the data classification unit 108 associates the classification information received by the classification information reception unit 106 with each of the predetermined number of data included in the data group (S20). The data classification unit 108 supplies information indicating the associated result to the data evaluation unit 114.
 データ評価部114は、データセットに含まれるデータ群のデータと分類情報(すなわち、所定の事案)との関連性を評価する(S25、データ評価ステップ)。データ評価部114は、評価結果を示す情報をデータ再抽出部116及び基準判定部120に供給する。基準判定部120は、評価結果を示す情報を解析し、評価結果が所定の基準を満たしているか否かを判定する(S30、基準判定ステップ)。一方、データ再抽出部116は、評価結果を示す情報を受け取ったことを契機とし、評価済みのデータから所定数のデータを再度抽出する(S35、データ選出ステップ)。データ再抽出部116は、抽出したデータをデータ再評価部118に供給する。 The data evaluation unit 114 evaluates the relevance between the data of the data group included in the data set and the classification information (that is, the predetermined case) (S25, data evaluation step). The data evaluation unit 114 supplies information indicating the evaluation result to the data re-extraction unit 116 and the reference determination unit 120. The reference determination unit 120 analyzes information indicating the evaluation result and determines whether or not the evaluation result satisfies a predetermined reference (S30, reference determination step). On the other hand, the data re-extraction unit 116 re-extracts a predetermined number of data from the evaluated data when receiving information indicating the evaluation result (S35, data selection step). The data re-extraction unit 116 supplies the extracted data to the data re-evaluation unit 118.
 データ再評価部118は、データ再抽出部116から受け取ったデータと分類情報との関連性を再度評価する(S40、データ再評価ステップ)。データ再評価部118は、評価結果を示す情報を基準変更部122に供給する。基準変更部122は、データ再評価部118が再評価した結果を示す情報を解析し、所定の基準を変更する(S45、基準変更ステップ)。そして、基準変更部122は、基準判定部120に、変更した所定の基準での判定を実行させる。 The data re-evaluation unit 118 re-evaluates the relationship between the data received from the data re-extraction unit 116 and the classification information (S40, data re-evaluation step). The data reevaluation unit 118 supplies information indicating the evaluation result to the reference changing unit 122. The reference changing unit 122 analyzes information indicating the result of the re-evaluation by the data reevaluating unit 118 and changes a predetermined reference (S45, reference changing step). Then, the reference changing unit 122 causes the reference determining unit 120 to execute determination based on the changed predetermined reference.
 図3は、本発明の実施の形態に係るデータ分析システムのハードウェア構成の一例を示す。 FIG. 3 shows an example of the hardware configuration of the data analysis system according to the embodiment of the present invention.
 本実施の形態に係るデータ分析システム1は、CPU1500と、グラフィックコントローラ1520と、RandomAccessMemory(RAM)、Read-OnlyMemory(ROM)及び/又はフラッシュROM等のメモリ1530と、データを記憶する記憶装置1540と、記録媒体からデータを読み込み及び/又は記録媒体にデータを書き込む読込み/書込み装置1545と、データを入力する入力装置1560と、外部の通信機器とデータを送受信する通信インターフェース1550と、CPU1500とグラフィックコントローラ1520とメモリ1530と記憶装置1540と読込み/書込み装置1545と入力装置1560と通信インターフェース1550とを互いに通信可能に接続するチップセット1510とを備える。 The data analysis system 1 according to the present embodiment includes a CPU 1500, a graphic controller 1520, a random access memory (RAM), a memory 1530 such as a read-only memory (ROM) and / or a flash ROM, and a storage device 1540 for storing data. A reading / writing device 1545 for reading data from and / or writing data to a recording medium, an input device 1560 for inputting data, a communication interface 1550 for transmitting / receiving data to / from an external communication device, a CPU 1500 and a graphic controller Chipset 1 that connects 1520, memory 1530, storage device 1540, read / write device 1545, input device 1560, and communication interface 1550 so that they can communicate with each other And a 10.
 チップセット1510は、メモリ1530と、メモリ1530にアクセスして所定の処理を実行するCPU1500と、外部の表示装置の表示を制御するグラフィックコントローラ1520とを相互に接続することにより、各構成要素間のデータの受渡しを実行する。CPU1500は、メモリ1530に格納されたプログラムに基づいて動作して、各構成要素を制御する。グラフィックコントローラ1520は、メモリ1530内に設けられたバッファ上に一時的に蓄えられた画像データに基づいて、画像を所定の表示装置に表示させる。 The chip set 1510 includes a memory 1530, a CPU 1500 that accesses the memory 1530 and executes predetermined processing, and a graphic controller 1520 that controls display on an external display device. Perform data passing. The CPU 1500 operates based on a program stored in the memory 1530 and controls each component. The graphic controller 1520 displays an image on a predetermined display device based on the image data temporarily stored on the buffer provided in the memory 1530.
 また、チップセット1510は、記憶装置1540と、読込み/書込み装置1545と、通信インターフェース1550とを接続する。記憶装置1540は、データ分析システム1のCPU1500が使用するプログラムとデータとを格納する。記憶装置1540は、例えば、フラッシュメモリである。読込み/書込み装置1545は、プログラム及び/又はデータを記憶している記憶媒体からプログラム及び/又はデータを読み取って、読み取ったプログラム及び/又はデータを記憶装置1540に格納する。読込み/書込み装置1545は、例えば、通信インターフェース1550を介し、インターネット上のサーバーから所定のプログラムを取得して、取得したプログラムを記憶装置1540に格納する。 Also, the chipset 1510 connects a storage device 1540, a read / write device 1545, and a communication interface 1550. The storage device 1540 stores programs and data used by the CPU 1500 of the data analysis system 1. The storage device 1540 is, for example, a flash memory. The read / write device 1545 reads the program and / or data from the storage medium storing the program and / or data, and stores the read program and / or data in the storage device 1540. For example, the reading / writing device 1545 acquires a predetermined program from a server on the Internet via the communication interface 1550, and stores the acquired program in the storage device 1540.
 通信インターフェース1550は、通信ネットワークを介して外部の装置とデータの送受信を実行する。また、通信インターフェース1550は、通信ネットワークが不通の場合、通信ネットワークを介さずに外部の装置とデータの送受信を実行することもできる。そして、キーボード、タブレット、マウス等の入力装置1560は、所定のインターフェースを介してチップセット1510と接続する。 The communication interface 1550 executes data transmission / reception with an external device via a communication network. Further, when the communication network is disconnected, the communication interface 1550 can execute data transmission / reception with an external device without going through the communication network. An input device 1560 such as a keyboard, a tablet, or a mouse is connected to the chip set 1510 via a predetermined interface.
 記憶装置1540に格納されるデータ分析システム1用のデータ分析プログラムは、インターネット等の通信ネットワーク、又は磁気記録媒体、光学記録媒体等の記録媒体を介して記憶装置1540に提供される。そして、記憶装置1540に格納されたデータ分析システム1用のプログラムは、CPU1500により実行される。 The data analysis program for the data analysis system 1 stored in the storage device 1540 is provided to the storage device 1540 via a communication network such as the Internet or a recording medium such as a magnetic recording medium or an optical recording medium. Then, the program for the data analysis system 1 stored in the storage device 1540 is executed by the CPU 1500.
 本実施の形態に係るデータ分析システム1により実行されるデータ分析プログラムは、CPU1500に働きかけて、データ分析システム1を、図1及び図2において説明したデータベース100、データ抽出部102、入力装置104、分類情報受付部106、データ分類部108、データ要素抽出部110、要素評価部112、データ評価部114、データ再抽出部116、データ再評価部118、基準判定部120、基準変更部122、パラメータ決定部124、及びデータ要素格納部126として機能させる。 The data analysis program executed by the data analysis system 1 according to the present embodiment works on the CPU 1500 to change the data analysis system 1 to the database 100, the data extraction unit 102, the input device 104, which are described in FIGS. Classification information reception unit 106, data classification unit 108, data element extraction unit 110, element evaluation unit 112, data evaluation unit 114, data re-extraction unit 116, data re-evaluation unit 118, reference determination unit 120, reference change unit 122, parameter It functions as the determination unit 124 and the data element storage unit 126.
(実施の形態の効果)
 本実施の形態に係るデータ分析システム1は、データの抽出、データと分類情報との関連性の評価、当該評価結果が所定の基準を満たしているか否かの判定を繰り返すことで、分類情報と関連性を有するデータの抽出精度を向上させることができるので、学習精度の劣化を防止できる。
(Effect of embodiment)
The data analysis system 1 according to the present embodiment repeats the extraction of data, the evaluation of the relationship between the data and the classification information, and the determination as to whether or not the evaluation result satisfies a predetermined standard, Since the extraction accuracy of relevant data can be improved, it is possible to prevent deterioration in learning accuracy.
(実施例)
 図4、図6、及び図8はそれぞれデータと分類情報との関連性を再度評価する前のフィッティングカーブであり、図5、図7、図9、及び図10はそれぞれ再度評価した後のフィッティングカーブの例を示す。なお、図10に係るデータのフィッティグ前の図は省略する。また、矢印で指示した変動しているグラフがフィッティングカーブであり、他方のグラフが生データを示すグラフである。
(Example)
4, 6, and 8 are fitting curves before re-evaluating the relationship between data and classification information, and FIGS. 5, 7, 9, and 10 are fitting curves after re-evaluation. An example of a curve is shown. Note that the figure before the data fitting according to FIG. 10 is omitted. Further, the fluctuating graph indicated by the arrow is a fitting curve, and the other graph is a graph showing raw data.
<基準を変更する方法の詳細>
(1)まず、ユーザ(システム管理者)が、ノーマライズドランクに対して予め閾値を設定しておく。例えば、図4において、1.E-03(=0.001=0.1%)とする。
(2)データ抽出部102、分類情報受付部106、及びデータ分類部108は、訓練データを受け取ってキーワード重みを学習する。また、訓練データに対してフィッティングカーブを決定する(図4、図6、図8)。
(3)データ分析システム1は、上記閾値に対応する、フィッティングカーブ上のドキュメントスコアを特定する。例えば、図4において、1.E-03に対応するドキュメントスコアは、ステージ1で0.68、ステージ2で0.5、ステージ3で0.36となる。
(4)データ分析システム1は、上記3つのドキュメントスコアを、各ステージのスコア閾値として設定する。
(5)データ評価部114は、プレディクティブコーディングを用いて、3つのステージに対応する3種類のスコアを、1つのドキュメントに対して算出する(すなわち、データベースには、キーワードの重みが3種類格納されている。例えば、「関係構築フェーズ」(フェーズ1)では、「日程」、「調整」などのキーワード重みが、「実行フェーズ」(フェーズ3)よりも大きかったり、「準備フェーズ」(フェーズ2)では、「競合製品」、「調査」などのキーワード重みが、「関係構築フェーズ」(フェーズ1)よりも大きかったりする。また、ステージごとに異なるキーワードが格納されている場合もある)。
(6)基準判定部120は、算出された3種類のスコアが、上記スコア閾値を超過しているか否かを判定する。
(7)基準判定部120は、例えば、フェーズ2に対応するスコアが上記スコア閾値を超過していると判定した場合、「現在、フェーズ2にある可能性が高い」と判断して、システム管理者にアラートを上げる。
(8)データ再抽出部116、又はシステム管理者は、算出されるスコアを参照し、システムの学習を調整する。例えば、高いスコアが付けられたドキュメントを検討したところ、「本件訴訟とは関係ない」として「Non-Responsive」のタグを付ける。逆に、低いスコアが付けられたドキュメントを検討したところ、「本件訴訟に関係する」として「Responsive」のタグを付ける。これらの情報がシステムにフィードバックされ、データ分析システム1は、例えば、当該ドキュメントに含まれるキーワードの重みを上下させたり、未登録であったキーワードおよび当該キーワードの重みをデータベースに追加したり、不要なキーワードをデータベースから削除したりする。
(9)データ再評価部118は、スコアをフェーズごとに再計算するとともに、指数関数を再度フィッティングする(図5、図7、図9、図10)。
(10)基準変更部122は、上記(3)および(4)と同様の処理により、スコア閾値を再設定する。
(11)データ分析システム1は、必要に応じて、上記(5)~(10)を繰り返す。
<Details of how to change the standard>
(1) First, a user (system administrator) sets a threshold in advance for the normalized rank. For example, in FIG. 4, it is assumed that 1.E-03 (= 0.001 = 0.1%).
(2) The data extraction unit 102, the classification information reception unit 106, and the data classification unit 108 receive training data and learn keyword weights. Moreover, a fitting curve is determined with respect to training data (FIG. 4, FIG. 6, FIG. 8).
(3) The data analysis system 1 specifies the document score on the fitting curve corresponding to the threshold value. For example, in FIG. 4, the document score corresponding to 1.E-03 is 0.68 at stage 1, 0.5 at stage 2, and 0.36 at stage 3.
(4) The data analysis system 1 sets the three document scores as score thresholds for each stage.
(5) The data evaluation unit 114 uses predictive coding to calculate three types of scores corresponding to the three stages for one document (that is, the database stores three types of keyword weights). For example, in the “relationship building phase” (phase 1), keyword weights such as “schedule” and “adjustment” are larger than those in the “execution phase” (phase 3), or “preparation phase” (phase 2). Then, keyword weights such as “competitive product” and “investigation” are larger than those in the “relationship building phase” (phase 1), and a different keyword may be stored for each stage).
(6) The reference determination unit 120 determines whether or not the calculated three types of scores exceed the score threshold.
(7) For example, when it is determined that the score corresponding to phase 2 exceeds the score threshold, the reference determination unit 120 determines that “there is a high possibility that the current phase is in phase 2”, and performs system management. Alerts people.
(8) The data re-extraction unit 116 or the system administrator refers to the calculated score and adjusts the learning of the system. For example, when a document with a high score is considered, the tag “Non-Responsive” is added as “unrelated to the lawsuit”. On the other hand, when a document with a low score is examined, the tag “Responsive” is added as “related to the lawsuit”. Such information is fed back to the system. For example, the data analysis system 1 can increase or decrease the weight of the keyword included in the document, add the unregistered keyword and the weight of the keyword to the database, Delete keywords from the database.
(9) The data re-evaluation unit 118 recalculates the score for each phase and refits the exponential function (FIGS. 5, 7, 9, and 10).
(10) The reference changing unit 122 resets the score threshold by the same processing as the above (3) and (4).
(11) The data analysis system 1 repeats the above (5) to (10) as necessary.
 データ分析システム1は、上記フローにより、(a)学習精度の劣化を防止する、(b)判定の精度を保ちつつ、自動的にキーワードのバリエーションを増やすことができる。すなわち、データ分析システム1は、自動的に動作環境に適応することができる(すなわち、自動環境適応の実現ができる。)。 The data analysis system 1 can automatically increase keyword variations while maintaining the accuracy of (a) preventing learning accuracy from being deteriorated and (b) determining by the above flow. That is, the data analysis system 1 can automatically adapt to the operating environment (that is, automatic environment adaptation can be realized).
 図4、図6、及び図8はそれぞれ、所定の異なる複数のデータからなるデータセットを分析した結果である。各図とも、所定のフェーズ(以下、実施例では「ステージ」という。)毎にフィッティングカーブを算出した。すなわち、実施例においては予め定められた第1フェーズ(ステージ1)、第2フェーズ(ステージ2)、及び第3フェーズ(ステージ3)のそれぞれについてフィッティングカーブを算出した。ここで、各図において、横軸はドキュメントスコアを示し、縦軸は対数スケールのノーマライズドランク(すなわち、ドキュメントスコアを昇順で並べた場合における順位)を示す。したがって、指数関数を用いたフィッティングカーブは、図において直線のものであり、縦軸の下ほどランクが高く、上ほどランクが低い。 FIG. 4, FIG. 6, and FIG. 8 each show the result of analyzing a data set composed of a plurality of predetermined different data. In each figure, a fitting curve was calculated for each predetermined phase (hereinafter referred to as “stage” in the examples). That is, in the example, the fitting curves were calculated for each of the predetermined first phase (stage 1), second phase (stage 2), and third phase (stage 3). Here, in each figure, the horizontal axis indicates the document score, and the vertical axis indicates the logarithmic scale normalized rank (that is, the rank when document scores are arranged in ascending order). Accordingly, the fitting curve using the exponential function is a straight line in the figure, and the rank is higher at the lower part of the vertical axis and the rank is lower at the upper part.
 一方、図5、図7、図9、及び図10はそれぞれ、所定の関数の指数パラメータを、データ再評価部118における再評価処理後にパラメータ決定部124において決定されたパラメータに変更した後のフィッティングカーブを示す。各図を参照すると分かるように、本実施の形態に係るデータ分析システム1によれば、決定係数R2が極めて1に近い値をとることから、再評価処理後のフィッティングカーブが精度よく実験データに適合することが示された。 On the other hand, FIG. 5, FIG. 7, FIG. 9, and FIG. 10 each show the fitting after changing the exponent parameter of a predetermined function to the parameter determined in the parameter determination unit 124 after the re-evaluation processing in the data re-evaluation unit 118. Shows the curve. As can be seen by referring to each figure, according to the data analysis system 1 according to the present embodiment, the determination coefficient R2 takes a value very close to 1, so that the fitting curve after the reevaluation process is accurately converted into experimental data. It was shown to fit.
〔文書以外のデータに適用する例〕
 なお、文書以外のデータにデータ分析システム1を適用することもできる。すなわち、データ分析システム1は、テキスト以外のデータを分析することもできる。例えば、データ分析システム1が音声を分析する場合、(1)音声を認識することによって当該音声に含まれる会話の内容を文字(テキスト)に変換し、当該テキストを分析してもよいし、(2)音声データをそのまま分析してもよい。
[Example applied to data other than documents]
Note that the data analysis system 1 can also be applied to data other than documents. That is, the data analysis system 1 can also analyze data other than text. For example, when the data analysis system 1 analyzes speech, (1) by recognizing the speech, the content of the conversation included in the speech may be converted into characters (text), and the text may be analyzed ( 2) The voice data may be analyzed as it is.
 上記(1)の場合、データ分析システム1は、任意の音声認識アルゴリズム(例えば、隠れマルコフモデルを用いた認識方法など)を用いることによって、音声をテキストに変換し、上記で説明した処理と同様の処理を、当該テキストに対して実行する。これにより、データ分析システム1は、音声を分析することができる。 In the case of (1) above, the data analysis system 1 converts speech into text by using an arbitrary speech recognition algorithm (for example, a recognition method using a hidden Markov model), and is similar to the processing described above. Is performed on the text. Thereby, the data analysis system 1 can analyze the voice.
 上記(2)の場合、データ分析システム1は、音声に含まれる部分音声(データ要素)を抽出する。例えば、「価格を調整する」という音声が得られた場合、データ分析システム1は「価格」および「調整」という部分音声を当該音声から抽出し、当該部分音声を評価した結果に基づいて、未分類の音声(未知データ)と分類情報(所定の事案)との関連性を評価することができる。この場合、データ分析システム1は、時系列データの分類アルゴリズム(例えば、隠れマルコフモデル、カルマンフィルタ、ニューラルネットワークなど)を利用して、音声を分別できる。これにより、データ分析システム1は、音声を分析することができる。 In the case of (2) above, the data analysis system 1 extracts a partial voice (data element) included in the voice. For example, when a voice “adjust price” is obtained, the data analysis system 1 extracts partial voices “price” and “adjustment” from the voice, and based on the evaluation result of the partial voice, It is possible to evaluate the relevance between the classification voice (unknown data) and the classification information (predetermined case). In this case, the data analysis system 1 can separate voices using a time series data classification algorithm (for example, a hidden Markov model, a Kalman filter, a neural network, etc.). Thereby, the data analysis system 1 can analyze the voice.
 または、データ分析システム1は、映像(動画)を分析することもできる。この場合、データ分析システム1は、映像に含まれるフレーム画像を抽出し、任意の顔認識技術を用いることによって、当該フレーム画像に含まれる人物を特定できる。また、データ分析システム1は、任意のモーション認識技術(例えば、パターンマッチング技術を応用するものであってよい)を用いることによって、上記映像に含まれる部分映像(上記映像に含まれる全フレーム画像のうちの一部を含む映像)から上記人物のモーション(動作)を抽出できる。そして、データ分析システム1は、上記人物および/またはモーションに基づいて、未分類の映像(未知データ)と分類情報(所定の事案)との関連性を評価することができる。これにより、データ分析システム1は、映像を分析することができる。 Alternatively, the data analysis system 1 can also analyze video (moving images). In this case, the data analysis system 1 can identify a person included in the frame image by extracting a frame image included in the video and using an arbitrary face recognition technique. Further, the data analysis system 1 uses an arbitrary motion recognition technique (for example, a pattern matching technique may be applied), thereby allowing a partial video included in the video (all frame images included in the video to be displayed). The motion (motion) of the person can be extracted from the video including a part of the video. The data analysis system 1 can evaluate the relevance between the unclassified video (unknown data) and the classification information (predetermined case) based on the person and / or motion. Thereby, the data analysis system 1 can analyze the video.
〔フォレンジックシステム以外の応用例〕
 データ分析システム1は、フォレンジックシステム(訴訟関連文書を抽出するシステム)だけでなく、以下についても適用できる。例えば、医療応用システム(電子カルテ・看護記録などをデータとして、病気などのリスク予測を行うシステム)に適用できる。この場合、当該医療応用システムは、データ(例えば、電子カルテ、看護記録など)と所定の事案(例えば、薬剤が患者に効果を発揮したこと、医者による診断後に患者が快方に向かったことなど)との関連性を評価することによって、例えば、当該薬剤の効果を客観的に評価したり、熟練医師による診断を他の患者に適用したりすることができる。なお、当該医療応用システムは、例えば、経過観察フェーズ(フェーズ1)、寛解フェーズ(フェーズ2)、完治フェーズ(フェーズ3)、再発フェーズ(フェーズ4)などのフェーズ毎に関連性を評価できる。
[Application examples other than forensic systems]
The data analysis system 1 can be applied not only to a forensic system (a system for extracting lawsuit related documents) but also to the following. For example, the present invention can be applied to medical application systems (systems that predict risks such as diseases using electronic medical records and nursing records as data). In this case, the medical application system uses data (for example, electronic medical records, nursing records, etc.) and a predetermined case (for example, that the drug has been effective for the patient, that the patient has been ready after the diagnosis by the doctor, etc.) For example, the effect of the drug can be objectively evaluated, or a diagnosis by a skilled doctor can be applied to other patients. In addition, the said medical application system can evaluate relevance for every phases, such as a follow-up phase (phase 1), a remission phase (phase 2), a complete cure phase (phase 3), a recurrence phase (phase 4), for example.
 また、データ分析システム1は、メール監査システムに適用することもできる。この場合、メール監査システムは、データ(例えば、ネットワーク上を日々流通する電子メール)と所定の事案(例えば、当該メールが組織の機密情報を漏洩させようとしていること、当該メールが他の組織に談合などの不正行為を持ちかけようとしていること)との関連性を評価することによって、例えば、当該組織に不利益をもたらす事件(例えば、情報漏洩、不正行為など)を未然に防止することができる。 The data analysis system 1 can also be applied to an email audit system. In this case, the e-mail auditing system determines that the data (for example, e-mail distributed daily on the network) and a predetermined case (for example, the e-mail is leaking confidential information of the organization, the e-mail is sent to another organization). By assessing the relevance of the rigging and other fraudulent acts (for example, collusion), it is possible to prevent incidents (for example, information leaks, fraud, etc.) that would be detrimental to the organization. .
 また、データ分析システム1は、インターネット応用システムに適用することもできる。この場合、当該インターネット応用システムは、データ(例えば、ユーザがSNSに投稿したメッセージ、ウェブサイトに掲載されたお勧め情報、ユーザまたは団体のプロフィールなど)と所定の事案(例えば、当該ユーザの嗜好と他のユーザの嗜好とが類似していること、当該ユーザの嗜好とレストランの属性とが一致していることなど)との関連性を評価することによって、例えば、当該ユーザと気の合いそうな他のユーザを一覧表示させたり、当該ユーザの嗜好に合ったレストランの情報を提示したり、当該ユーザに危害を与えかねない団体を警告したりすることができる。これにより、インターネット応用システム(データ分析システム1)は、インターネットの利便性を向上させることができる。 The data analysis system 1 can also be applied to an Internet application system. In this case, the Internet application system uses data (for example, a message posted by the user to the SNS, recommended information posted on the website, profile of the user or group, etc.) and a predetermined case (for example, the user's preference and the like). (E.g., the preference of other users is similar, the preference of the user matches the attribute of the restaurant, etc.) It is possible to display a list of other users, present restaurant information that suits the user's preferences, and warn organizations that may harm the user. Thereby, the Internet application system (data analysis system 1) can improve the convenience of the Internet.
 また、データ分析システム1は、ドライビング支援システムに適用することもできる。この場合、当該ドライビング支援システムは、データ(例えば、車載センサ・カメラ・マイクなどから取得されるデータ)と所定の事案(例えば、熟練ドライバによる運転中に、当該熟練ドライバが着目した情報など)との関連性を評価することによって、例えば、運転を安全・快適にし得る有用な情報を自動的に抽出することができる。 The data analysis system 1 can also be applied to a driving support system. In this case, the driving support system includes data (for example, data acquired from an in-vehicle sensor, a camera, a microphone, and the like) and a predetermined case (for example, information that the skilled driver paid attention to while driving by the skilled driver). For example, useful information that can make driving safer and more comfortable can be automatically extracted.
 また、データ分析システム1は、金融関連システムに適用することもできる。この場合、当該金融関連システムは、データ(例えば、銀行に対する届け出書類、株価の時価など)と所定の事案(例えば、不正目的のおそれがあること、株価が上昇すること)との関連性を評価することによって、例えば、不正目的を有する届け出を摘発したり、将来の株価を予測したりすることができる。 Also, the data analysis system 1 can be applied to financial related systems. In this case, the financial system evaluates the relevance of the data (for example, notification documents to the bank, the market price of the stock price, etc.) and a predetermined case (for example, there is a risk of fraud or a rise in the stock price). By doing so, for example, it is possible to detect a report having an unauthorized purpose or to predict a future stock price.
 さらに、データ分析システム1は、実績評価システムにも適用することができる。この場合、当該実績評価システムは、データ(例えば、営業部員が会社に提出する日報、コンサルタントが顧客に提出する分析資料)と所定の事案(例えば、当該営業部員が販売実績を上げること、当該コンサルタントから顧客から評価されること)との関連性を評価することによって、例えば、営業部員・コンサルタントの人事評価を行ったり、プロジェクトの成否を評価したりすることができる。 Furthermore, the data analysis system 1 can be applied to a performance evaluation system. In this case, the performance evaluation system includes data (for example, daily reports submitted by the sales staff to the company, analysis materials submitted by the consultant to the customer) and predetermined cases (for example, the sales staff will increase the sales performance, the consultant By evaluating the relevance to the customer), for example, it is possible to perform personnel evaluations of sales staff / consultants and to evaluate the success or failure of the project.
 このように、本発明のデータ分析システムは、フォレンジックシステム、ディスカバリ支援システム、医療応用システム、インターネット応用システム、ドライビング支援システム、金融関連システム、実績評価システムなど、データと所定の事案との関連性を評価することによって目的を達成する任意のシステムに適用することができる。 As described above, the data analysis system of the present invention shows the relationship between data and a predetermined case such as a forensic system, a discovery support system, a medical application system, an Internet application system, a driving support system, a financial related system, and a performance evaluation system. It can be applied to any system that achieves its purpose by evaluation.
 以上、本発明の実施の形態及び実施例を説明したが、上記に記載した実施の形態及び実施例は特許請求の範囲に係る発明を限定するものではない。また、実施の形態及び実施例の中で説明した特徴の組合せのすべてが発明の課題を解決するための手段に必須であるとは限らない点に留意すべきである。更に、上記した実施形態及び実施例の技術的要素は、単独で適用されてもよいし、プログラム部品とハードウェア部品とのような複数の部分に分割されて適用されるようにすることもできる。 The embodiments and examples of the present invention have been described above. However, the embodiments and examples described above do not limit the invention according to the claims. In addition, it should be noted that not all the combinations of features described in the embodiments and examples are essential to the means for solving the problems of the invention. Furthermore, the technical elements of the above-described embodiments and examples may be applied independently, or may be applied by being divided into a plurality of parts such as program parts and hardware parts. .
 1 データ分析システム
 100 データベース
 102 データ抽出部
 104 入力装置
 106 分類情報受付部
 108 データ分類部
 110 データ要素抽出部
 112 要素評価部
 114 データ評価部
 116 データ再抽出部(データ選出部)
 118 データ再評価部
 120 基準判定部
 122 基準変更部
 124 パラメータ決定部
 126 データ要素格納部
 1500 CPU
 1510 チップセット
 1520 グラフィックコントローラ
 1530 メモリ
 1540 記憶装置
 1545 読込み/書込み装置
 1550 通信インターフェース
 1560 入力装置
DESCRIPTION OF SYMBOLS 1 Data analysis system 100 Database 102 Data extraction part 104 Input device 106 Classification information reception part 108 Data classification part 110 Data element extraction part 112 Element evaluation part 114 Data evaluation part 116 Data re-extraction part (data selection part)
118 Data Re-evaluation Unit 120 Criteria Determination Unit 122 Criteria Change Unit 124 Parameter Determination Unit 126 Data Element Storage Unit 1500 CPU
1510 chip set 1520 graphic controller 1530 memory 1540 storage device 1545 read / write device 1550 communication interface 1560 input device

Claims (16)

  1.  訓練データから抽出されたパターンが未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するデータ分析システムであって、
     前記抽出されたパターンに基づいて前記関連性を評価するデータ評価部と、
     前記データ評価部による評価結果が、所定の基準を満たしているか否かを判定する基準判定部と、
     前記データ評価部によって前記関連性が評価された未知データから所定数のデータを選出するデータ選出部と、
     前記データ選出部によって選出されたデータと前記所定の事案との関連性を、当該選出されたデータに含まれるパターンに基づいて再評価するデータ再評価部と、
     前記データ再評価部による再評価結果に基づいて、前記基準判定部において用いられる前記所定の基準を変更する基準変更部とを備えたデータ分析システム。
    A data analysis system that evaluates the relationship between the unknown data and a predetermined case by analyzing whether the pattern extracted from the training data is included in the unknown data,
    A data evaluation unit for evaluating the relevance based on the extracted pattern;
    A criterion determination unit that determines whether the evaluation result by the data evaluation unit satisfies a predetermined criterion;
    A data selection unit that selects a predetermined number of data from the unknown data that has been evaluated by the data evaluation unit;
    A data re-evaluation unit that re-evaluates the relevance between the data selected by the data selection unit and the predetermined case based on a pattern included in the selected data;
    A data analysis system comprising: a reference changing unit that changes the predetermined reference used in the reference determination unit based on a re-evaluation result by the data re-evaluation unit.
  2.  前記再評価結果の分布に対して仮定したモデルのパラメータを、当該再評価結果に基づいて決定するパラメータ決定部をさらに備え、
     前記基準変更部は、前記決定されたパラメータによって同定されるモデルを用いて、前記所定の基準を変更するものであることを特徴とする請求項1に記載のデータ分析システム。
    A parameter determination unit that determines a parameter of the model assumed for the distribution of the re-evaluation results based on the re-evaluation results;
    The data analysis system according to claim 1, wherein the reference changing unit changes the predetermined reference using a model identified by the determined parameter.
  3.  前記モデルは、指数型分布族に属する関数であり、
     前記パラメータ決定部は、前記関数の指数パラメータを決定するものであることを特徴とする請求項2に記載のデータ分析システム。
    The model is a function belonging to the exponential distribution family,
    The data analysis system according to claim 2, wherein the parameter determination unit determines an exponent parameter of the function.
  4.  前記データ再評価部は、前記選出されたデータと前記所定の事案との結びつきの強さを示すスコアを算出することによって、前記関係性を評価するものであり、
     前記所定の基準は、前記スコアに対して設けられた閾値を超過したか否かに基づくものであり、
     前記基準変更部は、前記閾値を、前記スコアを所定の規則にしたがって並べ替えることにより決定されるランクから一意に特定可能な他の閾値に変更することによって、前記所定の基準を変更するものであることを特徴とする請求項1から3のいずれか一項に記載のデータ分析システム。
    The data re-evaluation unit evaluates the relationship by calculating a score indicating the strength of the connection between the selected data and the predetermined case.
    The predetermined criterion is based on whether or not a threshold set for the score is exceeded,
    The reference changing unit changes the predetermined reference by changing the threshold from a rank determined by rearranging the score according to a predetermined rule to another threshold that can be uniquely specified. The data analysis system according to claim 1, wherein the data analysis system is provided.
  5.  データに含まれる部分データに所定のデータ要素が出現するか否かを示す出現情報を、当該部分データごとに生成する生成部と、
     前記生成部によって生成された出現情報に、前記所定のデータ要素と当該所定のデータ要素とは異なる他のデータ要素との関係性を反映させることによって、前記部分データごとに関係性情報を得る関係性反映部とをさらに備え、
     前記データ評価部および/またはデータ再評価部は、前記関係性反映部によって得られた関係性情報にさらに基づいて、前記関連度を評価するものであることを特徴とする請求項1から4のいずれか一項に記載のデータ分析システム。
    A generation unit that generates, for each partial data, appearance information indicating whether or not a predetermined data element appears in the partial data included in the data;
    The relationship for obtaining the relationship information for each partial data by reflecting the relationship between the predetermined data element and another data element different from the predetermined data element in the appearance information generated by the generation unit And a sex reflection part,
    5. The data evaluation unit and / or the data re-evaluation unit evaluates the relevance based further on the relationship information obtained by the relationship reflection unit. The data analysis system according to any one of the above.
  6.  データは、事象に対するユーザの評価を少なくとも含むものであり、
     前記データを生成したユーザの感情であって、前記評価に基づいて生じた前記事象に対する感情を、当該データから抽出する感情抽出部をさらに備え、
     前記データ評価部および/またはデータ再評価部は、前記感情抽出部による抽出結果にさらに基づいて、前記関連度を評価するものであることを特徴とする請求項1から5のいずれか一項に記載のデータ分析システム。
    The data includes at least user ratings of events,
    The user's emotion that generated the data, further comprising an emotion extraction unit that extracts from the data the emotion for the event that occurred based on the evaluation,
    The said data evaluation part and / or a data reevaluation part evaluate the said relevance further based on the extraction result by the said emotion extraction part, The any one of Claim 1 to 5 characterized by the above-mentioned. The data analysis system described.
  7.  前記所定のパターンと当該所定のパターンが有する概念とを対応付けて記憶するデータベースを参照することによって、当該所定のパターンを含む前記部分データを要約可能な上位概念を、当該部分データから抽出する概念抽出部をさらに備えたことを特徴とする請求項1から6のいずれか一項に記載のデータ分析システム。 A concept for extracting, from the partial data, a superordinate concept capable of summarizing the partial data including the predetermined pattern by referring to a database that stores the predetermined pattern and a concept included in the predetermined pattern in association with each other. The data analysis system according to claim 1, further comprising an extraction unit.
  8.  所定数のデータを前記訓練データとしてデータベースに格納されたデータセットから抽出するデータ抽出部と、
     前記訓練データの分類を示す分類情報を、所定の入力装置を介してユーザから受け付ける分類情報受付部と、
     前記訓練データに前記受け付けられた分類情報を対応付けることによって、当該訓練データを分類するデータ分類部とをさらに備え、
     前記データ分析システムは、前記訓練データが分類された結果に基づいて、当該訓練データから前記パターンを抽出するものであることを特徴とする請求項1から7のいずれか一項に記載のデータ分析システム。
    A data extraction unit for extracting a predetermined number of data from the data set stored in the database as the training data;
    A classification information receiving unit that receives classification information indicating the classification of the training data from a user via a predetermined input device;
    A data classification unit that classifies the training data by associating the accepted classification information with the training data;
    The data analysis system according to any one of claims 1 to 7, wherein the data analysis system extracts the pattern from the training data based on a result obtained by classifying the training data. system.
  9.  前記データ評価部および/またはデータ再評価部は、所定の行為が進展する各段階を示す指標であるフェーズごとに前記関係性を評価するものであり、
     前記基準判定部は、前記フェーズごとに設けられた前記所定の基準を満たしているか否かを、当該フェーズごとに判定するものであり、
     前記基準変更部は、前記フェーズごとに設けられた前記所定の基準をそれぞれ変更するものであることを特徴とする請求項1から8のいずれか一項に記載のデータ分析システム。
    The data evaluation unit and / or the data reevaluation unit evaluates the relationship for each phase which is an index indicating each stage in which a predetermined action progresses,
    The reference determination unit determines, for each phase, whether or not the predetermined reference provided for each phase is satisfied,
    The data analysis system according to any one of claims 1 to 8, wherein the reference change unit changes the predetermined reference provided for each phase.
  10.  前記データ分類部によって分類されたデータからデータ要素を抽出するデータ要素抽出部と、
     前記データ要素を所定の要素基準にしたがって評価する要素評価部とをさらに備え、
     前記データ評価部および/またはデータ再評価部は、前記要素評価部によって評価された前記データ要素を用いることによって、前記関連性を評価するものであることを特徴とする請求項1から5のいずれか一項に記載のデータ分析システム。
    A data element extraction unit for extracting data elements from the data classified by the data classification unit;
    An element evaluation unit that evaluates the data element according to a predetermined element criterion,
    6. The data evaluation unit and / or the data re-evaluation unit evaluates the association by using the data element evaluated by the element evaluation unit. The data analysis system according to claim 1.
  11.  前記要素評価部は、前記データ要素と当該データ要素を含むデータに対応付けられた分類情報との依存関係を表わす伝達情報量を、前記所定の要素基準の1つとして用いることによって、当該データ要素を評価するものであることを特徴とする請求項10に記載のデータ分析システム。 The element evaluation unit uses, as one of the predetermined element criteria, a transmission information amount that represents a dependency relationship between the data element and classification information associated with data including the data element, so that the data element The data analysis system according to claim 10, wherein the data analysis system is evaluated.
  12.  前記データ要素抽出部によって抽出されたデータ要素と、前記要素評価部によって当該データ要素が評価された結果とを対応付けて、前記データベースに格納するデータ要素格納部をさらに備えたことを特徴とする請求項10または11に記載のデータ分析システム。 The data element extraction unit further includes a data element storage unit that associates the data element extracted by the data element extraction unit with the result of evaluation of the data element by the element evaluation unit and stores the data element in the database. The data analysis system according to claim 10 or 11.
  13.  前記データ評価部および/またはデータ再評価部は、前記データに含まれる第1データ要素と第2データ要素との相関に基づいて前記関係性を評価するものであることを特徴とする請求項1から12のいずれか一項に記載のデータ分析システム。 The data evaluation unit and / or the data reevaluation unit evaluates the relationship based on a correlation between a first data element and a second data element included in the data. The data analysis system according to any one of 1 to 12.
  14.  前記データは、テキストを少なくとも一部に含み、
     前記データ評価部および/またはデータ再評価部は、前記テキストに含まれるセンテンスまたは段落と前記所定の事案との関連性を評価し、当該評価結果に基づいて、前記データと当該所定の事案との関連性を評価するものであることを特徴とする請求項1から13のいずれか一項に記載のデータ分析システム。
    The data includes text at least in part,
    The data evaluation unit and / or the data reevaluation unit evaluates the relationship between the sentence or paragraph included in the text and the predetermined case, and based on the evaluation result, the data and the predetermined case The data analysis system according to any one of claims 1 to 13, wherein the relevance is evaluated.
  15.  訓練データから抽出されたパターンが、未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するデータ分析システムの制御方法であって、
     前記抽出したパターンに基づいて前記関連性を評価するデータ評価ステップと、
     前記データ評価ステップにおける評価結果が所定の基準を満たしているか否かを判定する基準判定ステップと、
     前記データ評価ステップにおいて前記関連性を評価した未知データから所定数のデータを選出するデータ選出ステップと、
     前記データ選出ステップにおいて選出したデータと前記所定の事案との関連性を、当該選出されたデータに含まれるパターンに基づいて再評価するデータ再評価ステップと、
     前記データ再評価ステップにおける再評価結果に基づいて、前記所定の基準を変更する基準変更ステップとを含むデータ分析システムの制御方法。
    A method of controlling a data analysis system that evaluates the relationship between the unknown data and a predetermined case by analyzing whether the pattern extracted from the training data is included in the unknown data,
    A data evaluation step of evaluating the relevance based on the extracted pattern;
    A criterion determination step for determining whether or not the evaluation result in the data evaluation step satisfies a predetermined criterion; and
    A data selection step of selecting a predetermined number of data from the unknown data evaluated for the association in the data evaluation step;
    A data re-evaluation step for re-evaluating the relationship between the data selected in the data selection step and the predetermined case based on a pattern included in the selected data;
    A control method for a data analysis system, comprising: a reference change step for changing the predetermined reference based on a re-evaluation result in the data re-evaluation step.
  16.  訓練データから抽出されたパターンが、未知データに含まれるか否かを分析することによって、当該未知データと所定の事案との関連性を評価するデータ分析システムの制御プログラムであって、
     コンピュータに、
     前記抽出されたパターンに基づいて前記関連性を評価するデータ評価機能と、
     前記データ評価機能による評価結果が所定の基準を満たしているか否かを判定する基準判定機能と、
     前記データ評価機能によって前記関連性が評価された未知データから所定数のデータを選出するデータ選出機能と、
     前記データ選出機能によって選出されたデータと前記所定の事案との関連性を、当該選出されたデータに含まれるパターンに基づいて再評価するデータ再評価機能と、
     前記データ再評価機能による再評価結果に基づいて、前記所定の基準を変更する基準変更機能とを実現させるデータ分析システムの制御プログラム。
    A control program for a data analysis system that evaluates the relationship between the unknown data and a predetermined case by analyzing whether the pattern extracted from the training data is included in the unknown data,
    On the computer,
    A data evaluation function for evaluating the relevance based on the extracted pattern;
    A criterion determination function for determining whether or not an evaluation result by the data evaluation function satisfies a predetermined criterion; and
    A data selection function for selecting a predetermined number of data from unknown data whose relevance was evaluated by the data evaluation function;
    A data re-evaluation function for re-evaluating the relevance between the data selected by the data selection function and the predetermined case based on a pattern included in the selected data;
    A control program for a data analysis system for realizing a reference changing function for changing the predetermined reference based on a re-evaluation result by the data re-evaluation function.
PCT/JP2015/050517 2015-01-09 2015-01-09 Data analysis system, data analysis system control method, and data analysis system control program WO2016111007A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/050517 WO2016111007A1 (en) 2015-01-09 2015-01-09 Data analysis system, data analysis system control method, and data analysis system control program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/050517 WO2016111007A1 (en) 2015-01-09 2015-01-09 Data analysis system, data analysis system control method, and data analysis system control program

Publications (1)

Publication Number Publication Date
WO2016111007A1 true WO2016111007A1 (en) 2016-07-14

Family

ID=56355727

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/050517 WO2016111007A1 (en) 2015-01-09 2015-01-09 Data analysis system, data analysis system control method, and data analysis system control program

Country Status (1)

Country Link
WO (1) WO2016111007A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558966A (en) * 2018-10-28 2019-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligence sentences the processing system that card predicted events occur

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002202983A (en) * 2000-12-28 2002-07-19 Matsushita Electric Ind Co Ltd Method and device for preparing calculation standard of attribution degree to classification
JP2004070636A (en) * 2002-08-06 2004-03-04 Mitsubishi Electric Corp Concept searching device
JP2004514220A (en) * 2000-11-15 2004-05-13 株式会社ジャストシステム Method and apparatus for analyzing emotions and emotions in text
JP2011227688A (en) * 2010-04-20 2011-11-10 Univ Of Tokyo Method and device for extracting relation between two entities in text corpus
JP2013182338A (en) * 2012-02-29 2013-09-12 Ubic:Kk Document classification system and document classification method and document classification program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004514220A (en) * 2000-11-15 2004-05-13 株式会社ジャストシステム Method and apparatus for analyzing emotions and emotions in text
JP2002202983A (en) * 2000-12-28 2002-07-19 Matsushita Electric Ind Co Ltd Method and device for preparing calculation standard of attribution degree to classification
JP2004070636A (en) * 2002-08-06 2004-03-04 Mitsubishi Electric Corp Concept searching device
JP2011227688A (en) * 2010-04-20 2011-11-10 Univ Of Tokyo Method and device for extracting relation between two entities in text corpus
JP2013182338A (en) * 2012-02-29 2013-09-12 Ubic:Kk Document classification system and document classification method and document classification program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558966A (en) * 2018-10-28 2019-04-02 西南电子技术研究所(中国电子科技集团公司第十研究所) Intelligence sentences the processing system that card predicted events occur

Similar Documents

Publication Publication Date Title
JP6182279B2 (en) Data analysis system, data analysis method, data analysis program, and recording medium
Mostafa Clustering halal food consumers: A Twitter sentiment analysis
Brown et al. The algorithm audit: Scoring the algorithms that score us
Guo et al. Big social data analytics in journalism and mass communication: Comparing dictionary-based text analysis and unsupervised topic modeling
Mostafa Mining and mapping halal food consumers: A geo-located Twitter opinion polarity analysis
Scheuerman et al. Auto-essentialization: Gender in automated facial analysis as extended colonial project
JP5885875B1 (en) Data analysis system, data analysis method, program, and recording medium
WO2017067153A1 (en) Credit risk assessment method and device based on text analysis, and storage medium
US20160098480A1 (en) Author moderated sentiment classification method and system
JP5723067B1 (en) Data analysis system, data analysis method, and data analysis program
Bass et al. A cross-cultural analysis of the relations of physical and relational aggression with peer victimization
JPWO2016125310A1 (en) Data analysis system, data analysis method, and data analysis program
WO2016203652A1 (en) System related to data analysis, control method, control program, and recording medium therefor
JP7280705B2 (en) Machine learning device, program and machine learning method
WO2016189605A1 (en) Data analysis system, control method, control program, and recording medium
JP5933863B1 (en) Data analysis system, control method, control program, and recording medium
JP2017201543A (en) Data analysis system, data analysis method, data analysis program, and recording media
JP6124936B2 (en) Data analysis system, data analysis method, and data analysis program
Wang et al. Sentiment analysis of tweets and government translations: Assessing China’s post-COVID-19 landscape for signs of withering or booming
WO2016111007A1 (en) Data analysis system, data analysis system control method, and data analysis system control program
JP6178480B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
Stankevich et al. Predicting personality traits from social network profiles
JP6026036B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
JP2019194793A (en) Information processing apparatus and program
Miano et al. Disparities in Forensic Science Adoption for Crime Investigation in Kenya: The Role of Police Demographics

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15876882

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: JP

122 Ep: pct application non-entry in european phase

Ref document number: 15876882

Country of ref document: EP

Kind code of ref document: A1