WO2021088499A1 - 一种基于动态网络表征的***虚开识别方法及*** - Google Patents

一种基于动态网络表征的***虚开识别方法及*** Download PDF

Info

Publication number
WO2021088499A1
WO2021088499A1 PCT/CN2020/113450 CN2020113450W WO2021088499A1 WO 2021088499 A1 WO2021088499 A1 WO 2021088499A1 CN 2020113450 W CN2020113450 W CN 2020113450W WO 2021088499 A1 WO2021088499 A1 WO 2021088499A1
Authority
WO
WIPO (PCT)
Prior art keywords
enterprise
network
day
representation
characterization
Prior art date
Application number
PCT/CN2020/113450
Other languages
English (en)
French (fr)
Inventor
郑庆华
董博
阮建飞
范弘铖
Original Assignee
西安交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西安交通大学 filed Critical 西安交通大学
Publication of WO2021088499A1 publication Critical patent/WO2021088499A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Definitions

  • the invention belongs to the technical field of tax control, and particularly relates to a method and system for identifying false invoice issuance based on dynamic network representation.
  • False invoice issuance refers to the use of various behavioral means by enterprises to issue invoices that are inconsistent with actual business conditions in order to achieve the purpose of tax evasion.
  • network characterization technology provides a solution.
  • the method of identifying false invoice issuance based on network representation can organize isolated report information into a corporate transaction network, thereby systematically verifying all companies, and at the same time, it can also use inter-enterprise contacts to obtain more corporate information to identify false invoice companies.
  • the following patents provide reference methods based on network characterization technology to automatically identify false invoices through computers:
  • Literature 1 A detection method for false VAT invoices based on parallel loop detection (201710147850.8);
  • Document 2 A method for identifying suspicious taxpayers based on the taxpayer’s interest-related network (201410328391.X);
  • Literature 1 organizes invoice information into a static network with enterprises as nodes, and improves loop detection in the network.
  • the improvement method is to distribute computing tasks to multiple computers in a distributed cluster through a distributed parallel computing method to improve efficiency , And finally use an improved loop detection method to detect false VAT invoices.
  • Literature 2 identifies suspicious taxpayers based on the topological characteristics of the taxpayer's interest-related network (TPIN), analyzes the topological characteristics of the taxpayer's interest-related network, and obtains the taxpayer's characterization in the interest-related network, and then uses the C4.5 classifier experiment , So as to realize the function of automatically identifying suspicious taxpayers.
  • TPIN topological characteristics of the taxpayer's interest-related network
  • Literature 1 can only detect the false invoice issuance behavior of funds returning to the source account after passing through multiple accounts, and the invoice false issuance has various forms and is not limited to the loop form.
  • the method of identification The type is too single, and the generalization ability of the model is poor;
  • Literature 2 is only based on the topological structure of the taxpayer and the interest relationship, ignoring the attribute information of the enterprise, and homogenizing the enterprise, which cannot be analyzed from the perspective of enterprise scale, market share, etc.;
  • Literature 1 and Literature 2 are both limited to static networks, unable to dynamically analyze the changes in corporate transactions combined with historical information, and unable to accurately grasp the dynamic changes, which allows some companies to take advantage of them.
  • the purpose of the present invention is to provide a method and system for identifying false invoices based on dynamic network representation.
  • the invention adopts dynamic network representation, dynamically analyzes the enterprise transaction network in combination with historical information, and accurately grasps the dynamic changes of enterprise transactions; and can identify different invoice false issuing behaviors based on the related information between enterprises; at the same time, it draws on the distributed optimization algorithm to The calculation function is decomposed into independent sub-functions to be executed in parallel, which improves the efficiency of identifying false invoices.
  • a method for identifying false invoice issuance based on dynamic network representation First, the company’s transaction information is organized into a static network with the company as the node and transaction records as the edge; second, the company’s transaction network representation is established with each day as the time node.
  • a 30-day time sequence window in which 30-day static network representations are merged each time within the time sequence window, and the static network representations of all time nodes are gradually merged through the moving time sequence window to obtain the final dynamic network representation results; again, borrowing from the distributed
  • the optimization algorithm decomposes the objective function of the characterization into independent sub-functions, and optimizes the sub-functions in parallel to improve the learning efficiency of the model; finally, a two-classifier is constructed based on LightGBM to identify the enterprises suspected of false invoices.
  • the method specifically includes the following implementation steps:
  • the data is preprocessed, and then the basic information of the company is extracted.
  • the basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized deal with;
  • Step 2 Feature extraction based on dynamic network representation
  • the enterprise After extracting the basic characteristics of the enterprise, the enterprise is the node, the basic information of the enterprise is the node attribute, the transaction record is the edge, and the transaction information is the attribute of the edge, and each day is the time node, and the enterprise transaction information is organized into a static network; then 30 A time sequence window is established in units of days, and 30-day static network representations are merged within the window each time, and static network representations at all times are gradually merged through the moving time sequence window to optimize the objective function of the network representation, and finally obtain the optimal dynamic enterprise transaction network Characterization
  • Step 3 Based on distributed algorithm optimization
  • Step 4 Build a classifier to identify false invoices
  • step 1 The implementation method of step 1 is as follows:
  • Step 101 data preprocessing
  • Step 102 processing text data
  • the processing of text information in the enterprise basic information table includes:
  • Step 103 processing logo type data
  • Use One-Hot coding for the discrete category data in the basic information table of the enterprise use the number of attribute values as the length to establish a status bit to mark each specific state;
  • Step 104 processing numerical data
  • step 2 The implementation method of step 2 is as follows:
  • Step 201 Establish a static corporate transaction network
  • a representation model of the corporate transaction network is established every day, so that companies with similar topological structures or higher transaction weights are closer in the representation space.
  • the objective optimization function is:
  • H i and H j characterize enterprise i and j;
  • w ij is the weight between the trading enterprise; minimize w ij
  • Step 202 Dynamically integrate historical information
  • is a parameter that defines the structural characteristics of the model and the degree of contribution to the degree of the original matrix. The larger the ⁇ the more the model pays attention to the time-series network representation, the smaller the more the node Characterization
  • step 3 The implementation method of step 3 is as follows:
  • Step 301 Decompose the objective function
  • Step 302 execute multiple sub-functions in parallel
  • Step 303 comprehensively sort the parallel results
  • Step 4 the implementation method is as follows:
  • Step 401 Combine the basic features obtained in step 1 and the dynamic network features obtained in step 3 as the learning data of the classifier;
  • Step 402 Construct a two-classification model based on LightGBM, and set the main parameters of the classifier as follows: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
  • Step 403 Take the characterization results obtained from the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1, and then randomly divide the training set into two groups. Use the training set to train the classification model of step 2 and use the verification set to adjust the training. If over-fitting occurs, perform pruning operations; select the optimal model to verify the algorithm in the test set accuracy;
  • Step 404 Input the characterization result of the unmarked enterprise sample into the LightGBM-based prediction model of the suspected false invoice issuance enterprise, and finally, based on the output of the prediction model, determine whether the target company has false invoice issuance behavior.
  • the present invention has the following beneficial effects:
  • the present invention is a method for identifying enterprises suspected of issuing false invoices based on the idea of dynamic network representation learning, and has the following advantages:
  • the calculation function is decomposed into independent sub-functions for parallel execution, which reduces the time complexity of computing network representation and improves the efficiency of identifying false invoices.
  • Figure 1 is the overall framework flow chart
  • Figure 2 is a schematic diagram of the basic feature extraction process
  • Figure 3 is a schematic diagram of a feature extraction process based on dynamic network representation
  • Figure 4 is a schematic diagram of the optimization process of the network characterization algorithm
  • Figure 5 is a schematic diagram of the process of constructing a classifier to identify false invoices
  • Fig. 6 is a schematic diagram of a system for identifying false invoice issuance based on dynamic network representation according to an embodiment of the present invention.
  • the method for identifying false invoices based on dynamic network representation includes the following steps:
  • the basic information of the company is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, the categorical data is coded with One-Hot, and the numerical data is standardized. .
  • the basic feature extraction implementation process specifically includes the following steps:
  • Step 1 Extract the "Taxpayer Electronic File Number" as the unique identifier of the company's characteristics, and delete all other attributes that cannot describe the company's own distribution rules;
  • Step 2 When the attribute contains a large number of missing values and only a few valid values, for example, the attributes of "taxpayer tax agency code", "financial report type” and “accounting form” are less than 10% of the enterprises with value. Choose to directly delete this feature; when the attribute has a small number of missing values, for example, "employees" and "registered capital” attributes have missing values in individual companies, choose the same mean imputation method to fill in the missing values.
  • Step 1 Use the Jieba word segmentation tool for word segmentation, construct a suitable stop table, and remove the stop words in the text.
  • the content of the "business scope" field of an enterprise in this embodiment is "production, sales: ceramics and products; goods import and export, technology import and export”.
  • the result is "production, sales, ceramics and products, goods import and export, technology import and export”;
  • Step 2 Use the dictionary tree to count the results of step 1, and select words with larger weights as keywords;
  • Step 3 Convert the N types of keywords extracted in step 2 into vectors based on word2vec.
  • One-Hot coding is used for the discrete categorical data "enterprise type” and "enterprise status" in the enterprise basic information table.
  • the number of possible values of the attribute is expressed as the length of the status bit, one of which is marked as 1 and the other is marked as 0 to indicate a specific state.
  • the "enterprise type” field has four possible values “individual proprietorship”, “partnership”, “limited liability company” and “limited liability company”. Therefore, the length of the status bit of "enterprise type” is 4, where 1000 means “sole proprietorship", 0100 means “partnership”, 0010 means “limited liability company”, and 0001 means "limited liability company”.
  • Step 1 Obtain the mean value of the "registered capital” attribute
  • n represents the number of basic information samples of the enterprise
  • x j represents the value of the j-th "registered capital”attribute
  • Step 2 Get the variance of each attribute
  • ⁇ 2 be the variance of the "registered capital” attribute, and its specific calculation form is:
  • Mean and variance are the basic indicators of numerical attributes, and numerical attributes can be standardized through the mean and variance;
  • Step 1 Establish a static corporate transaction network
  • the characterization h of each enterprise on the day can be obtained, so that the enterprises with similar transaction structure or significant transaction rights are closer in the characterization space, and then the characterization of the entire enterprise transaction network on that day can be obtained.
  • Step 2 Dynamically integrate historical information
  • the length of the timing window is 30 days. Within the timing window, 30 days of static network characteristics are merged each time, and then the timing window is moved to gradually merge all static network characteristics to minimize the target
  • the specific steps of the distributed algorithm optimization implementation process include:
  • the gradient descent algorithm is used to solve equation (4).
  • the current or Stop updating at time indicating that they are approximately equal when the representation is the representation of the corporate transaction network on that day. Therefore, for the dynamic trading network distributed on the first to T days, the characterization of the network can be obtained by calculating in order.
  • the basic feature vector of the enterprise obtained in S101 is directly placed after the dynamic network feature vector obtained in S103, and then combined into a new vector as the learning data of the classifier
  • the main parameters for setting the classifier are: the number of leaves is 13, the learning rate is 0.1, and the number of iterations is 100;
  • Step 1 Take the characterization results obtained by the sample set of enterprises marked as false invoices and the sample set of normal enterprises as basic features, and randomly divide them into two groups as the training set and the test set at a ratio of 3:1.
  • Step 2 Randomly select 10% of the data in the training set as the validation set.
  • Step 3 Use the training set to train the classification model built by S502, use the validation set to adjust the training, and perform pruning when over-fitting occurs;
  • Step 4 Iterative calculation. Since the number of iterations is set to 100, if the convergence condition is not reached for 100 iterations, the iteration is forced to stop, and the result of the last iteration is the calculated representation.
  • Step 5 Select the optimal model to verify the accuracy of the algorithm in the test set.
  • the accuracy rate verified in this embodiment is 0.957, the precision is 0.921, and the recall rate is 0.87, indicating that the model has a very good effect on the test set and can reach Requirements for the identification of false invoices in actual tax scenarios.
  • the accuracy rate is 0.876, the accuracy is 0.856, and the recall rate is 0.794.
  • the method of the present invention has improved recognition accuracy rate of 9.25%, accuracy of 7.6%, and recall rate of 9.57%. .
  • the running time of the distributed algorithm for the data sample in this embodiment is 684.57s, which is more
  • the running time of the distributed algorithm is reduced by 28.56% in 958.19s.
  • Input the characterization results of the unlabeled enterprise samples into the trained prediction model of the suspected false invoice issuance enterprise. Based on the output of the prediction model, determine whether the target enterprise has false invoice issuance behavior. In this embodiment, the predicted value is sorted from high to low. , And take the top ten percent as a suspected enterprise of false invoices
  • a system for identifying false invoices based on dynamic network representation includes:
  • the enterprise attribute feature extraction module is used to extract the basic information of the enterprise after preprocessing the data.
  • the basic information of the enterprise is roughly divided into three types: the text data is converted into a vector by the word2vec algorithm, and the categorical data is encoded by One-Hot , To standardize numerical data;
  • the dynamic network characterization building module is used to process the attribute characteristics of the enterprise to obtain the static transaction network characterization of the enterprise with each day as the time node, and then establish a 30-day time sequence window, and integrate the static network characterization through the regular term in the window, and pass Sliding the window on the time series to gradually merge all static network representations to obtain dynamic network representations;
  • Parallel optimization of the dynamic network characterization module is used to decompose the goal of enterprise dynamic network characterization into independent sub-goals.
  • Parallel optimization of the sub-objectives improves the efficiency of dynamic network characterization and obtains the final characterization result more efficiently;
  • the invoice false issuance recognition module is used to use the obtained enterprise dynamic network as the characteristics of the invoice false issuance behavior and input it into the binary classifier based on LightGBM, and use the marked enterprise sample set to train the invoice false issuance recognition model.
  • the characterization results of the sample set of enterprises for prediction are input into the trained model for prediction, and then the enterprises suspected of issuing false invoices are obtained.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Evolutionary Biology (AREA)
  • Development Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Strategic Management (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Finance (AREA)
  • Technology Law (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于动态网络表征的***虚开识别方法及***。首先,以企业为节点、以交易记录为边,把企业交易信息组织成静态网络;其次,以每一天为时间节点建立企业交易网络的表征,建立长度为30天的时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间节点的静态网络表征得到最终的动态网络表征结果;再次,借鉴了分布式优化算法,把表征的目标函数分解为独立子函数,并行优化子函数提高了模型的学习效率;最后,基于LightGBM构建二分类器识别出***虚开嫌疑企业。本发明基于动态网络表征来识别***虚开嫌疑企业,提高了***虚开识别的效率和准确率。

Description

一种基于动态网络表征的***虚开识别方法及*** 【技术领域】
本发明属于税控技术领域,特别涉及一种基于动态网络表征的***虚开识别方法及***。
【背景技术】
***虚开是指企业动用各种行为手段开具与实际经营业务情况不符的***,以达到偷漏税的目的。
虚开***的行为将造成国家税收的巨大损失,严重破坏***经济秩序。目前的税务局识别***虚开嫌疑企业的途径主要为:举报、日常监管抽查和问题企业牵连,然后再由税务稽查人员基于企业提供的报表进行核对。这些稽查都具有极大的偶然性,无法***地对所有企业进行分析评估;而且单凭税务稽查人员人工核对工作量大效率低,检查数据还局限在单家企业提供的报表,无法结合上下游有关联的企业。
为了解决当前***虚开识别所面临的问题,网络表征技术提供了一种解决途径。基于网络表征的***虚开识别方法可以把孤立的报表信息组织成为企业交易网络,从而***地核查所有企业,同时还可以用企业间的联系得到更多的企业信息用以识别***虚开企业。以下专利提供了可参考的基于网络表征技术通过计算机自动地进行***虚开识别的相关方法:
文献1.一种基于并行环路检测的虚开增值税专用***检测方法(201710147850.8);
文献2.一种基于纳税人利益关联网络的可疑纳税人识别方法 (201410328391.X);
文献1以企业为节点把***信息组织成静态网络,并对网络中的环路检测进行改进,改进方法为通过分布式并行计算方法将计算任务分配给分布式集群中的多台计算机以提高效率,最终通过改进的环路检测方法来进行虚开增值税专用***检测。
文献2基于纳税人利益关联网络(TPIN)的拓扑特征识别可疑纳税人,对纳税人利益关联网络进行拓扑特征的分析,得到纳税人在利益关联网络中的表征,再使用C4.5分类器实验,从而实现自动识别可疑纳税人的功能。
以上文献所述方法主要存在以下问题:文献1仅能检测资金经过多个账户后重新回到源账户的***虚开行为,而***虚开形式多样,不局限于环路形式,该方法的识别类型过于单一,模型的泛化能力较差;文献2仅基于纳税人和利益关系的拓扑结构,忽略了企业的属性信息,将企业同一化,无法从企业的规模、市场份额等角度来分析;文献1和文献2都局限于静态网络,无法结合历史信息动态地分析企业交易的变化,无法准确把握其动态变化,就让一些企业有机可乘。例如,某偷漏税企业每年的账单单独看是毫无问题,连续几年处于亏损状态,但水电成本却逐年增加,***虚开行为通常会隐藏在这类和时间序列相关的特征中,而静态网络无法捕捉这类特征。
【发明内容】
为了提高***虚开识别的效率,本发明的目的在于提供一种基于动态网络表征的***虚开识别方法及***。本发明采用动态网络表征,结合历史信息动态地分析企业交易网络,准确把握企业交易的动态变化;而且基于企业间的关联信息能够识别不同的***虚开行为;同时借鉴了分布式优化算法,把计算函数分解为 独立子函数并行执行,提高了***虚开识别的效率。
为达到上述目的,本发明采用以下技术方案予以实现:
一种基于动态网络表征的***虚开识别方法,首先,以企业为节点、交易记录为边,把企业交易信息组织成静态网络;其次,以每一天为时间节点建立企业交易网络的表征,建立长度为30天的时序窗口,在时序窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间节点的静态网络表征得到最终的动态网络表征结果;再次,借鉴了分布式优化算法,把表征的目标函数分解为独立子函数,并行优化子函数提高模型的学习效率;最后,基于LightGBM构建二分类器识别出***虚开嫌疑企业。
本发明进一步的改进在于:
该方法具体包括以下实现步骤:
步骤1,基本特征提取
首先对数据进行预处理,然后提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理;
步骤2,基于动态网络表征的特征提取
提取企业基本特征后,以企业为节点,企业基本信息为节点属性,以交易记录为边,交易信息为边的属性,以每一天为时间节点,把企业交易信息组织成静态网络;然后以30天为单位建立时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间的静态网络表征,优化网络表征的目标函数,最后得到最优的动态企业交易网络表征;
步骤3,基于分布式的算法优化
为了提高动态网络表征的学习效率,借鉴分布式优化算法,把动态企业交易网络表征的目标函数分解为独立子函数,并行优化子函数加速了大规模复杂的企业交易网络表征的求解;
步骤4,构建分类器识别***虚开
基于LightGBM分类器构建二分类模型,把计算得到的动态网络表征作为分类器的学习数据,用已标记的企业样本集来训练模型,然后把需要进行预测的企业样本集的表征结果放入训练好的模型中进行预测,最后根据预测模型的输出确定目标企业是否存在***虚开行为。
步骤1的实现方法如下:
步骤101,数据预处理
(1)提取”纳税人电子档案号”,作为企业特征唯一标识;
(2)处理缺失值:数据缺失严重的属性和与***虚开任务不相关的属性直接删去,有少量缺失的重要属性用同类均值插补的方法补全缺失值;
步骤102,处理文本型数据
对企业基本信息表中的文本信息处理包括:
(1)使用Jieba分词工具把企业的文本型数据进行分词;
(2)用词典树统计分词的结果,选择出权重较大的词作为关键词;
(3)基于word2vec将提取出来的N类关键词转成向量;
步骤103,处理标志型数据
对企业基本信息表中离散的类别型数据采用One-Hot编码;以属性取值的数量为长度建立状态位标志每一特定状态;
步骤104,处理数值型数据
对企业基本信息表中的数值型数据采用传统的标准化方法进行处理:
(1)求各属性的均值;
(2)求各属性的方差;
(3)Z-Score标准化。
步骤2的实现方法如下:
步骤201:建立静态的企业交易网络
每一天都建立一个企业交易网络的表征模型,使得具有相似拓扑结构或者交易权重更高的企业在表征空间离得更近,目标优化函数为:
Figure PCTCN2020113450-appb-000001
其中,h i和h j是企业i和j的表征;w ij是企业间交易的权重;最小化w ij||h i-h j|| 2时,就迫使越大的交易权重w ij对应的企业表征i和j越接近;
最小化目标
Figure PCTCN2020113450-appb-000002
得到该天优化后的企业交易网络表征h;
步骤202:动态融合历史信息
建立一个长度为30天的时序窗口,在窗口内每次融合30天的静态网络表征,然后移动时序窗口,逐步融合所有静态网络表征,最终得到动态的企业交易网络表征,对应的优化目标是:
Figure PCTCN2020113450-appb-000003
其中,
Figure PCTCN2020113450-appb-000004
分别表示第t天的企业p,q的表征和企业间交易的权重,
Figure PCTCN2020113450-appb-000005
则表示企业p和企业q的表征的近似程度;H i表示时序窗口内第i天的网络表征;惩罚项
Figure PCTCN2020113450-appb-000006
使表征学习到的矩阵尽量逼近原企业交易网络的矩阵,ρ是一个定义模型的结构特性和对原矩阵逼近程度贡献程度的参数,ρ越大模型越注重时序的网络表征,越小越注重节点的表征;
最小化目标
Figure PCTCN2020113450-appb-000007
得到优化后的动态企业交易网络表征H。
步骤3的实现方法如下:
步骤301,分解目标函数
对优化函数(2)进行重构,将其写成可分解的形式:
Figure PCTCN2020113450-appb-000008
其中,
Figure PCTCN2020113450-appb-000009
分别表示第t天的企业p,q的表征和企业间交易的权重,
Figure PCTCN2020113450-appb-000010
则表示企业p和企业q的表征的近似程度;惩罚项
Figure PCTCN2020113450-appb-000011
是在式(2)逼近原企业交易网络的矩阵的基础上,把数据拆分为单个企业进行计算;
最小化目标
Figure PCTCN2020113450-appb-000012
得到优化后的动态企业交易网络表征H;
步骤302,并行执行多个子函数
把(3)式分解为N个子优化函数,N为网络节点数,表示企业交易网络中企业的个数,对其并行求解以得到H t k+1
Figure PCTCN2020113450-appb-000013
其中,
Figure PCTCN2020113450-appb-000014
代表与企业v的有关联的企业,
Figure PCTCN2020113450-appb-000015
表示第t天的企业v的表征,
Figure PCTCN2020113450-appb-000016
表示第t天的企业v迭代计算k次后的表征,
Figure PCTCN2020113450-appb-000017
表示第t天企业v,q间交易的权重,
Figure PCTCN2020113450-appb-000018
则表示第t天迭代(k-1)次后的企业v和企业q的表征的近似程度;
Figure PCTCN2020113450-appb-000019
表示企业v在第i天和第t天的表征的近似程度;
其中,
Figure PCTCN2020113450-appb-000020
为所要求解的企业v在第t天的表征,使用迭代的优化方法判断计算结果是否达到要求的精确度:通过梯度下降算法对其进行求解,当达到收敛条件
Figure PCTCN2020113450-appb-000021
或者
Figure PCTCN2020113450-appb-000022
时,优化函数取得最优值;当一个企业第k次迭代和第(k-1)次迭代后得到的结果达到要求精确度时;或者当一个企业的迭代结果与其关联企 业离得足够近时,停止更新,得到的第k次迭代的表征结果就为该天该企业的表征;
步骤303,综合整理并行的结果
并行计算交易网络的N个节点就可得到每个企业在第t天的表征,再对于分布在时间节点1到T上的动态交易网络,按顺序计算求出每个时间节点上的网络的表征。
步骤4,的实现方法如下:
步骤401,将步骤1得到的基本特征和步骤3得到的动态网络特征结合到一起作为分类器的学习数据;
步骤402,基于LightGBM构建二分类模型,将分类器的主要参数设置为:叶子数为13,学习速率为0.1,迭代次数为100;
步骤403,把标记为虚开***的企业样本集和正常企业样本集获得的表征结果作为基础特征,并按照3:1的比例随机分为两组作为训练集和测试集,训练集中再随机分出百分之十的数据作为验证集;用训练集训练步骤2的分类模型,用验证集调整训练,如果出现过拟合现象,则进行剪枝操作;选取最优模型在测试集验证算法的准确性;
步骤404,将未标记的企业样本的表征结果输入至基于LightGBM的***虚开嫌疑企业预测模型,最后基于预测模型的输出,确定目标企业是否存在***虚开行为。
与现有技术相比,本发明具有以下有益效果:
本发明是基于动态网络表征学习思想提出的一种***虚开嫌疑企业识别的方法,具有以下优势:
1.采用动态网络表征,结合历史信息,为所有时间节点的网络学习出表征向量并融合,能够准确把握企业交易网络的动态变化,提高***虚开识别的准确率;
2.基于企业间的关联信息,能够识别不同类型的虚开***行为;
3.借鉴了分布式优化算法,把计算函数分解为独立子函数并行执行,降低了计算网络表征的时间复杂度,提高了***虚开识别的效率。
【附图说明】
图1为整体框架流程图;
图2为基本特征提取流程示意图;
图3为基于动态网络表征的特征提取流程示意图;
图4为网络表征算法优化流程示意图;
图5为构建分类器识别***虚开流程示意图;
图6为本发明实施例的一种基于动态网络表征的***虚开识别***的示意图。
【具体实施方式】
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,不是全部的实施例,而并非要限制本发明公开的范围。此外,在以下说明中,省略了对公知结构和技术的描述,以避免不必要的混淆本发明公开的概念。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
下面结合附图对本发明做进一步详细描述:
参见图1,基于动态网络表征的***虚开识别方法,包括下述步骤:
S101.基本特征提取
对数据进行预处理后,提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理。
如图2所示,基本特征提取实施过程具体包括以下步骤:
S201.数据预处理
步骤1:提取”纳税人电子档案号”作为企业特征唯一标识,其余不能刻画企业自身分布规律的属性都直接删去;
步骤2:当属性含有大量缺失值而仅有极少量有效值时,例如,”纳税人税务机构代码”、”财务报表种类”和”核算形式”属性仅有不到10%的企业有值,选择直接删除该特征;当属性有少量缺失值时,例如,”从业人数”和”注册资本”属性有个别企业出现缺失值,选择同类均值插补的方法来补全缺失值。
S202.处理文本型数据
对企业基本信息表中的文本型数据”货物信息”和”经营范围”进行数据的预处理并进行特征提取。文本型数据处理具体步骤包括:
步骤1:使用Jieba分词工具进行分词,构建合适的停用表,去掉文本中的停用词。例如,本实施例中某企业的”经营范围”字段内容为”生产、销售:陶瓷并品;货物进出口、技术进出口”。经过分词并去掉停用词后结果为”生产、销售、陶瓷并品、货物进出口、技术进出口”;
步骤2:把步骤1的结果用词典树进行统计,选择出权重较大的词作为关键词;
步骤3:基于word2vec将步骤2提取出来的N类关键词转成向量。
S203.处理类别型数据
对企业基本信息表中的离散的类别型数据”企业类型”和”企业状态”采用One-Hot编码。把属性可能取值的数量表示为状态位的长度,把其中一位标志为1其余全标为0表示某一特定状态。例如,本实施例中”企业类型”字段有四种可能取值”个人独资企业”、”合伙企业”、”有限责任公司”和”股份有限公司”。所以”企业类型”的状态位长度为4,其中1000表示“个人独资企业”、0100表示”合伙企业”、0010表示”有限责任公司”、0001表示”股份有限公司”。
S204.处理数值型数据
对企业基本信息表中的数值型数据”注册资本”、”投资总额”和”从业人数”,进行标准化处理,本实施例以”注册资本”为例说明:
步骤1:获取”注册资本”属性的均值
记u为”注册资本”属性的均值,其具体的计算形式为:
Figure PCTCN2020113450-appb-000023
其中,n表示企业基本信息样本的数量,x j表示第j个”注册资本”属性取值;
步骤2:获取各个属性的方差
记σ 2为”注册资本”属性的方差,其具体的计算形式为:
Figure PCTCN2020113450-appb-000024
均值和方差是数值型属性的基本指标,通过均值和方差可对数值型属性做标准化处理;
步骤3:Z-Score标准化
记δ为”注册资本”标准化后的值,其中δ=(δ 12,L,δ n),δ j表示第j个”注册资本”标准化后的值,δ j具体的计算形式为:
δ j=(x j-u)/σ,j=1,2,L,n
S102.基于动态网络表征的特征提取
首先以企业为节点、以交易记录为边、以每一天为时间节点建立静态的企业交易网络;然后以30天为单位建立时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间的静态网络表征,优化网络表征的目标函数,得到最优的动态企业交易网络表征。
如图3所示,基于动态网络表征的特征提取实施过程具体步骤包括:
步骤1:建立静态的企业交易网络
建立每天一个企业交易网络的表征模型,目标优化函数为:
Figure PCTCN2020113450-appb-000025
最小化目标
Figure PCTCN2020113450-appb-000026
就可求得该天各个企业的表征h,使得具有相似交易结构或者交易权重大的企业在表征空间离得更近,进而得到该天整个企业交易网络的表征。
步骤2:动态融合历史信息
在时序窗口内逐步融合所有静态企业交易网络表征,最终得到动态的企业交易网络表征,优化目标为:
Figure PCTCN2020113450-appb-000027
时序窗口长度为一个30天,在时序窗口内每次融合30天的静态网络表征,然后移动时序窗口,逐步融合所有静态网络表征,最小化目标
Figure PCTCN2020113450-appb-000028
就可求得该天各个企业的表征H。本实施例中,发现ρ=0.75时效果最好,此时较平衡地关注了时序的网络表征和节点的表征;
S103.基于分布式的算法优化
首先分解目标函数;然后并行执行多个子函数;最后综合整理并行的结果。
如图4所示,基于分布式的算法优化实施过程具体步骤包括:
S401.分解目标函数
重构优化函数(2),将其写成可分解的形式:
Figure PCTCN2020113450-appb-000029
本实施例中,企业交易网络共涉及有3765个企业,所以取N=3765,v从1到3765取值计算每一个企业及其有关联的交易网络;取ρ=0.75较平衡地关注了时序的网络表征和节点的表征;
S402.并行执行多个子函数
把(3)式按每个企业v分解为3765个子优化函数,对其并行求解最终合并得到H t k+1,其中单个子目标优化函数为:
Figure PCTCN2020113450-appb-000030
本实施例中,取ρ=0.75较平衡地关注了时序的网络表征和节点的表征。按顺序计算就可得到各子函数的计算结果,
Figure PCTCN2020113450-appb-000031
为各个子函数求解得到的每一企业在第t天第k次迭代后的表征,从而得到
Figure PCTCN2020113450-appb-000032
为第t天第k次迭代后动态企业交易网络的表征;
S403.综合整理并行的结果
用梯度下降算法对(4)式求解,本实施例中,设置了当
Figure PCTCN2020113450-appb-000033
或者
Figure PCTCN2020113450-appb-000034
时停止更新,表示他们近似相等时的表征就是该天企业交易网络的表征。于是对于分布在第1到T天上的动态交易网络,按顺序计算就可以求出每 一天的网络的表征。
S104.构建分类器识别***虚开
首先将S101得到的基本特征和S102得到的动态网络特征结合作为分类器的学习数据;其次基于LightGBM分类器构建二分类模型;然后用已标记是否虚开***的企业样本集来训练模型;最后把需要进行预测的企业样本集的表征结果放入训练好的模型中进行预测,基于预测模型的输出,确定目标企业是否存在***虚开行为。
如图5所示,构建分类器识别***虚开实施过程具体步骤包括:
S501.得到分类器的学习数据
将S101得到的基本特征和S103得到的动态网络特征结合到一起作为分类器的学习数据。本实施例中直接把S101得到的企业基本特征向量放在S103得到的动态网络特征向量后,组合成为新的向量,作为分类器的学习数据
S502.基于LightGBM构建二分类模型
设置分类器的主要参数为:叶子数为13,学习速率为0.1,迭代次数为100;
S503.训练模型
步骤1:把标记为虚开***的企业样本集和正常企业样本集获得的表征结果作为基础特征,并按照3:1的比例随机分为两组作为训练集和测试集。
步骤2:在训练集中随机分出百分之十的数据作为验证集。
步骤3:用训练集训练S502构建的分类模型,用验证集调整训练,出现过拟合现象时进行剪枝操作;
步骤4:迭代计算,由于迭代次数设置了100,所以若迭代100次尚未到达到收敛条件则强制停止迭代,取最后一次迭代结果就是计算所得表征。
步骤5:选取最优模型在测试集验证算法的准确性,本实施例验证得的准确率为0.957,精度为0.921,回召率为0.87,说明模型在测试集的效果非常好,可以达到在实际税务场景***虚开识别的要求。对比其他基于静态网络表征的***虚开识别方法的准确率0.876,精度0.856,回召率0.794,本发明的方法识别准确率提高了9.25%,精度提高了7.6%,回召率提高了9.57%。本发明的方法识别***虚开的效果提升除了表现在准确率提高,还体现在分布式并行运算的识别效率的提高:本实施例的数据样本采用分布式算法的运行时间为684.57s,比非分布式算法的运行时间958.19s缩短了28.56%。
S504.预测***虚开嫌疑企业
将未标记的企业样本的表征结果输入至训练好的***虚开嫌疑企业预测模型,基于预测模型的输出,确定目标企业是否存在***虚开行为,本实施例中把预测值从高到低排序,取前百分之十作为***虚开嫌疑企业。
在本发明的另一个实施例中,提供了一种基于动态网络表征的***虚开识别***,所述***包括:
企业属性特征提取模块,用以对数据进行预处理后,提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理;
动态网络表征构建模块,用以以每一天为时间节点,处理企业属性特征得到企业的静态交易网络表征,然后建立长度为30天的时序窗口,在窗口内通过正则项融合静态网络表征,并通过在时间序列上滑动窗口来逐步融合所有静态网络表征得到动态网络表征;
并行优化动态网络表征模块,用以把企业动态网络表征的目标分解为独立子 目标,并行优化子目标提高动态网络表征的效率,更高效地得到最终表征结果;
***虚开识别模块,用以把得到的企业动态网络的表征作为***虚开行为特征,输入到基于LightGBM构建的二分类器中,用已标记的企业样本集训练***虚开识别模型,将需要进行预测的企业样本集的表征结果输入训练好的模型中进行预测,进而得到***虚开嫌疑企业。
以上内容仅为说明本发明的技术思想,不能以此限定本发明的保护范围,凡是按照本发明提出的技术思想,在技术方案基础上所做的任何改动,均落入本发明权利要求书的保护范围之内。

Claims (6)

  1. 一种基于动态网络表征的***虚开识别方法,其特征在于,首先,以企业为节点、交易记录为边,把企业交易信息组织成静态网络;其次,以每一天为时间节点建立企业交易网络的表征,建立长度为30天的时序窗口,在时序窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间节点的静态网络表征得到最终的动态网络表征结果;再次,借鉴了分布式优化算法,把表征的目标函数分解为独立子函数,并行优化子函数提高模型的学习效率;最后,基于LightGBM构建二分类器识别出***虚开嫌疑企业。
  2. 根据权利要求1所述的一种基于动态网络表征的***虚开识别方法,其特征在于,该方法具体包括以下实现步骤:
    步骤1,基本特征提取
    首先对数据进行预处理,然后提取企业基本信息,企业基本信息大致分为三个类型:对文本型数据用word2vec算法转换为向量,对类别型数据用One-Hot编码,对数值型数据进行标准化处理;
    步骤2,基于动态网络表征的特征提取
    提取企业基本特征后,以企业为节点,企业基本信息为节点属性,以交易记录为边,交易信息为边的属性,以每一天为时间节点,把企业交易信息组织成静态网络;然后以30天为单位建立时序窗口,在窗口内每次融合30天的静态网络表征,并通过移动时序窗口逐步融合所有时间的静态网络表征,优化网络表征的目标函数,最后得到最优的动态企业交易网络表征;
    步骤3,基于分布式的算法优化
    为了提高动态网络表征的学习效率,借鉴分布式优化算法,把动态企业交易网络表征的目标函数分解为独立子函数,并行优化子函数加速了大规模复杂的企 业交易网络表征的求解;
    步骤4,构建分类器识别***虚开
    基于LightGBM分类器构建二分类模型,把计算得到的动态网络表征作为分类器的学习数据,用已标记的企业样本集来训练模型,然后把需要进行预测的企业样本集的表征结果放入训练好的模型中进行预测,最后根据预测模型的输出确定目标企业是否存在***虚开行为。
  3. 根据权利要求2所述的一种基于动态网络表征的***虚开识别方法,其特征在于,步骤1的实现方法如下:
    步骤101,数据预处理
    (1)提取“纳税人电子档案号”,作为企业特征唯一标识;
    (2)处理缺失值:数据缺失严重的属性和与***虚开任务不相关的属性直接删去,有少量缺失的重要属性用同类均值插补的方法补全缺失值;
    步骤102,处理文本型数据
    对企业基本信息表中的文本信息处理包括:
    (1)使用Jieba分词工具把企业的文本型数据进行分词;
    (2)用词典树统计分词的结果,选择出权重较大的词作为关键词;
    (3)基于word2vec将提取出来的N类关键词转成向量;
    步骤103,处理标志型数据
    对企业基本信息表中离散的类别型数据采用One-Hot编码;以属性取值的数量为长度建立状态位标志每一特定状态;
    步骤104,处理数值型数据
    对企业基本信息表中的数值型数据采用传统的标准化方法进行处理:
    (1)求各属性的均值;
    (2)求各属性的方差;
    (3)Z-Score标准化。
  4. 根据权利要求3所述的一种基于动态网络表征的***虚开识别方法,其特征在于,步骤2的实现方法如下:
    步骤201:建立静态的企业交易网络
    每一天都建立一个企业交易网络的表征模型,使得具有相似拓扑结构或者交易权重更高的企业在表征空间离得更近,目标优化函数为:
    Figure PCTCN2020113450-appb-100001
    其中,h i和h j是企业i和j的表征;w ij是企业间交易的权重;最小化w ij||h i-h j|| 2时,就迫使越大的交易权重w ij对应的企业表征i和j越接近;
    最小化目标
    Figure PCTCN2020113450-appb-100002
    得到该天优化后的企业交易网络表征h;
    步骤202:动态融合历史信息
    建立一个长度为30天的时序窗口,在窗口内每次融合30天的静态网络表征,然后移动时序窗口,逐步融合所有静态网络表征,最终得到动态的企业交易网络表征,对应的优化目标是:
    Figure PCTCN2020113450-appb-100003
    其中,
    Figure PCTCN2020113450-appb-100004
    分别表示第t天的企业p,q的表征和企业间交易的权重,
    Figure PCTCN2020113450-appb-100005
    则表示企业p和企业q的表征的近似程度;H i表示时序窗口内第i天的网络表征;惩罚项
    Figure PCTCN2020113450-appb-100006
    使表征学习到的矩阵尽量逼近原企业交易网络的矩阵,ρ是一个定义模型的结构特性和对原矩阵逼近程度贡献程度的参数,ρ越大模型越注重时序的网络表征,越小越注重节点的表征;
    最小化目标
    Figure PCTCN2020113450-appb-100007
    得到优化后的动态企业交易网络表征H。
  5. 根据权利要求4所述的一种基于动态网络表征的***虚开识别方法,其特征在于,步骤3的实现方法如下:
    步骤301,分解目标函数
    对优化函数(2)进行重构,将其写成可分解的形式:
    Figure PCTCN2020113450-appb-100008
    其中,
    Figure PCTCN2020113450-appb-100009
    分别表示第t天的企业p,q的表征和企业间交易的权重,
    Figure PCTCN2020113450-appb-100010
    则表示企业p和企业q的表征的近似程度;惩罚项
    Figure PCTCN2020113450-appb-100011
    是在式(2)逼近原企业交易网络的矩阵的基础上,把数据拆分为单个企业进行计算;
    最小化目标
    Figure PCTCN2020113450-appb-100012
    得到优化后的动态企业交易网络表征H;
    步骤302,并行执行多个子函数
    把(3)式分解为N个子优化函数,N为网络节点数,表示企业交易网络中企业的个数,对其并行求解以得到
    Figure PCTCN2020113450-appb-100013
    Figure PCTCN2020113450-appb-100014
    其中,
    Figure PCTCN2020113450-appb-100015
    代表与企业v的有关联的企业,
    Figure PCTCN2020113450-appb-100016
    表示第t天的企业v的表征,
    Figure PCTCN2020113450-appb-100017
    表示第t天的企业v迭代计算k次后的表征,
    Figure PCTCN2020113450-appb-100018
    表示第t天企业v,q间交易的权重,
    Figure PCTCN2020113450-appb-100019
    则表示第t天迭代(k-1)次后的企业v和企业q的表征的近似程度;
    Figure PCTCN2020113450-appb-100020
    表示企业v在第i天和第t天的表征的近似程度;
    其中,
    Figure PCTCN2020113450-appb-100021
    为所要求解的企业v在第t天的表征,使用迭代的优化方法判断计算结果是否达到要求的精确度:通过梯度下降算法对其进行求解,当达到收敛条件
    Figure PCTCN2020113450-appb-100022
    或者
    Figure PCTCN2020113450-appb-100023
    时,优化函数取得最优值;当一个企业第k次迭代和第(k-1) 次迭代后得到的结果达到要求精确度时;或者当一个企业的迭代结果与其关联企业离得足够近时,停止更新,得到的第k次迭代的表征结果就为该天该企业的表征;
    步骤303,综合整理并行的结果
    并行计算交易网络的N个节点就可得到每个企业在第t天的表征,再对于分布在时间节点1到T上的动态交易网络,按顺序计算求出每个时间节点上的网络的表征。
  6. 根据权利要求5所述的一种基于动态网络表征的***虚开识别方法,其特征在于,步骤4的实现方法如下:
    步骤401,将步骤1得到的基本特征和步骤3得到的动态网络特征结合到一起作为分类器的学习数据;
    步骤402,基于LightGBM构建二分类模型,将分类器的主要参数设置为:叶子数为13,学习速率为0.1,迭代次数为100;
    步骤403,把标记为虚开***的企业样本集和正常企业样本集获得的表征结果作为基础特征,并按照3:1的比例随机分为两组作为训练集和测试集,训练集中再随机分出百分之十的数据作为验证集;用训练集训练步骤2的分类模型,用验证集调整训练,如果出现过拟合现象,则进行剪枝操作;选取最优模型在测试集验证算法的准确性;
    步骤404,将未标记的企业样本的表征结果输入至基于LightGBM的***虚开嫌疑企业预测模型,最后基于预测模型的输出,确定目标企业是否存在***虚开行为。
PCT/CN2020/113450 2019-11-04 2020-09-04 一种基于动态网络表征的***虚开识别方法及*** WO2021088499A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911066791.7 2019-11-04
CN201911066791.7A CN110852856B (zh) 2019-11-04 2019-11-04 一种基于动态网络表征的***虚开识别方法

Publications (1)

Publication Number Publication Date
WO2021088499A1 true WO2021088499A1 (zh) 2021-05-14

Family

ID=69598895

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113450 WO2021088499A1 (zh) 2019-11-04 2020-09-04 一种基于动态网络表征的***虚开识别方法及***

Country Status (2)

Country Link
CN (1) CN110852856B (zh)
WO (1) WO2021088499A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326377A (zh) * 2021-06-02 2021-08-31 上海生腾数据科技有限公司 一种基于企业关联关系的人名消歧方法及***
CN113642735A (zh) * 2021-07-28 2021-11-12 浪潮软件科技有限公司 虚开纳税人识别的持续学习方法
CN114219287A (zh) * 2021-12-15 2022-03-22 中国软件与技术服务股份有限公司 一种基于图神经网络的纳税人风险评测方法
CN115334005A (zh) * 2022-03-31 2022-11-11 北京邮电大学 基于剪枝卷积神经网络和机器学习的加密流量识别方法
CN117876140A (zh) * 2024-03-13 2024-04-12 杭州工猫科技有限公司 税务信息处理方法、***与存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852856B (zh) * 2019-11-04 2022-10-25 西安交通大学 一种基于动态网络表征的***虚开识别方法
CN111382843B (zh) * 2020-03-06 2023-10-20 浙江网商银行股份有限公司 企业上下游关系识别模型建立、关系挖掘的方法及装置
CN111966889B (zh) * 2020-05-20 2023-04-28 清华大学深圳国际研究生院 一种图嵌入向量的生成方法以及推荐网络模型的生成方法
CN111724241B (zh) * 2020-06-05 2024-03-29 西安交通大学 基于动态边特征的图注意力网络的企业***虚开检测方法
CN112215616B (zh) * 2020-11-30 2021-04-30 四川新网银行股份有限公司 一种基于网络的自动识别资金异常交易的方法和***
CN114297319A (zh) * 2021-12-23 2022-04-08 税友信息技术有限公司 一种数据识别方法及相关装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108461A1 (en) * 2009-01-07 2014-04-17 Oracle International Corporation Generic Ontology Based Semantic Business Policy Engine
CN104103011A (zh) * 2014-07-10 2014-10-15 西安交通大学 一种基于纳税人利益关联网络的可疑纳税人识别方法
CN106920162A (zh) * 2017-03-14 2017-07-04 西京学院 一种基于并行环路检测的虚开增值税专用***检测方法
CN109583978A (zh) * 2018-11-30 2019-04-05 税友软件集团股份有限公司 一种识别虚开***企业的方法、装置及设备
CN110852856A (zh) * 2019-11-04 2020-02-28 西安交通大学 一种基于动态网络表征的***虚开识别方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2679209C2 (ru) * 2014-12-15 2019-02-06 Общество с ограниченной ответственностью "Аби Продакшн" Обработка электронных документов для распознавания инвойсов
CN106780001A (zh) * 2016-12-26 2017-05-31 税友软件集团股份有限公司 一种***虚开企业监控识别方法及***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140108461A1 (en) * 2009-01-07 2014-04-17 Oracle International Corporation Generic Ontology Based Semantic Business Policy Engine
CN104103011A (zh) * 2014-07-10 2014-10-15 西安交通大学 一种基于纳税人利益关联网络的可疑纳税人识别方法
CN106920162A (zh) * 2017-03-14 2017-07-04 西京学院 一种基于并行环路检测的虚开增值税专用***检测方法
CN109583978A (zh) * 2018-11-30 2019-04-05 税友软件集团股份有限公司 一种识别虚开***企业的方法、装置及设备
CN110852856A (zh) * 2019-11-04 2020-02-28 西安交通大学 一种基于动态网络表征的***虚开识别方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YU HONGCHAO [email protected]; HE HUAN [email protected]; ZHENG QINGHUA [email protected]; DONG BO DONG.BO@M: "TaxVis: a Visual System for Detecting Tax Evasion Group", THE WORLD WIDE WEB CONFERENCE, ACM, 2 PENN PLAZA, SUITE 701NEW YORKNY10121-0701USA, 13 May 2019 (2019-05-13) - 17 May 2019 (2019-05-17), 2 Penn Plaza, Suite 701New YorkNY10121-0701USA, pages 3610 - 3614, XP058471442, ISBN: 978-1-4503-6674-8, DOI: 10.1145/3308558.3314144 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326377A (zh) * 2021-06-02 2021-08-31 上海生腾数据科技有限公司 一种基于企业关联关系的人名消歧方法及***
CN113326377B (zh) * 2021-06-02 2023-10-13 上海生腾数据科技有限公司 一种基于企业关联关系的人名消歧方法及***
CN113642735A (zh) * 2021-07-28 2021-11-12 浪潮软件科技有限公司 虚开纳税人识别的持续学习方法
CN113642735B (zh) * 2021-07-28 2023-07-18 浪潮软件科技有限公司 虚开纳税人识别的持续学习方法
CN114219287A (zh) * 2021-12-15 2022-03-22 中国软件与技术服务股份有限公司 一种基于图神经网络的纳税人风险评测方法
CN115334005A (zh) * 2022-03-31 2022-11-11 北京邮电大学 基于剪枝卷积神经网络和机器学习的加密流量识别方法
CN115334005B (zh) * 2022-03-31 2024-03-22 北京邮电大学 基于剪枝卷积神经网络和机器学习的加密流量识别方法
CN117876140A (zh) * 2024-03-13 2024-04-12 杭州工猫科技有限公司 税务信息处理方法、***与存储介质

Also Published As

Publication number Publication date
CN110852856B (zh) 2022-10-25
CN110852856A (zh) 2020-02-28

Similar Documents

Publication Publication Date Title
WO2021088499A1 (zh) 一种基于动态网络表征的***虚开识别方法及***
Zhao et al. Distributed feature selection for efficient economic big data analysis
CN110532542B (zh) 一种基于正例与未标注学习的***虚开识别方法及***
CN110415111A (zh) 基于用户数据与专家特征合并逻辑回归信贷审批的方法
CN111783829A (zh) 一种基于多标签学习的财务异常检测方法及装置
CN104850868A (zh) 一种基于k-means和神经网络聚类的客户细分方法
CN115547466B (zh) 基于大数据的医疗机构登记评审***及其方法
Li et al. RETRACTED ARTICLE: Data mining optimization model for financial management information system based on improved genetic algorithm
CN110689437A (zh) 一种基于随机森林的通信施工项目财务风险预测方法
CN111754317A (zh) 一种金融投资数据测评方法及***
CN113590807A (zh) 一种基于大数据挖掘的科技企业信用评价方法
Ding et al. A novel hybrid method for oil price forecasting with ensemble thought
CN111626331B (zh) 一种自动化行业分类装置及其工作方法
CN111724241B (zh) 基于动态边特征的图注意力网络的企业***虚开检测方法
CN112329862A (zh) 基于决策树的反洗钱方法及***
CN111625578A (zh) 适用于文化科技融合领域时间序列数据的特征提取方法
CN115860927A (zh) 一种数据分析方法、装置、计算机设备及存储介质
Zhang A model combining LightGBM and neural network for high-frequency realized volatility forecasting
Najadat et al. Performance evaluation of industrial firms using DEA and DECORATE ensemble method.
Guo et al. Statistical decision research of long-term deposit subscription in banks based on decision tree
CN114187081A (zh) 估值表处理方法、装置、电子设备及计算机可读存储介质
CN111967937A (zh) 一种基于时间序列分析的电商推荐***及实现方法
Shen et al. Stock trends prediction by hypergraph modeling
Wang Research on enterprise financial performance evaluation method based on data mining
Sun et al. [Retracted] Enterprise Financial Risk Analysis Based on Improved Model C‐Means Clustering Algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20884592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20884592

Country of ref document: EP

Kind code of ref document: A1