CN102541913B - VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented - Google Patents

VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented Download PDF

Info

Publication number
CN102541913B
CN102541913B CN201010609743.0A CN201010609743A CN102541913B CN 102541913 B CN102541913 B CN 102541913B CN 201010609743 A CN201010609743 A CN 201010609743A CN 102541913 B CN102541913 B CN 102541913B
Authority
CN
China
Prior art keywords
keyword
vsm
ossp
oss
pages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010609743.0A
Other languages
Chinese (zh)
Other versions
CN102541913A (en
Inventor
王怀民
朱沿旭
尹刚
袁霖
史殿习
米海波
滕猛
刘惠
刘波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201010609743.0A priority Critical patent/CN102541913B/en
Publication of CN102541913A publication Critical patent/CN102541913A/en
Application granted granted Critical
Publication of CN102541913B publication Critical patent/CN102541913B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides the VSM classifier trainings of web oriented, the identification of the OSSP pages and OSS Resource Access methods.Wherein, VSM classifier trainings method includes:Based on OSSP page recognition feature vectors, VSM graders are trained with original training set;The OSSP pages recognition feature vector is:The VSM grader characteristic vectors that wherein 7 or whole 8 of selection are constituted as component are chosen in software version control manages keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate keyword, modification daily record keyword, task list keyword, and software control administration order.OSSP page identification methods are then to identify whether Web page is the OSSP pages according to the VSM graders trained.OSS Resource Access method is then that OSS resources are searched in the OSSP pages identified and are downloaded it to local.The present invention can significantly improve the accuracy of the OSSP pages identification of web oriented;Enough improve OSS resource searchings and the completeness downloaded;OSS resources can more accurately be obtained.

Description

VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented
Technical field
The present invention relates to Web document classification and Web page information extraction technology field, specifically, the present invention relates to VSM Classifier training method, OSSP (open source software homepage) page identification methods and OSS (open source software) Resource Access method.
Background technology
First, OSS brief introductions
Open source software English is Open Source Software, is abbreviated as OSS.Move and send out from nineteen eighty-three free software Rise, the supporter of open source software advocates the spirit of " freely, participate in, offer as a tribute and cooperate " always, and this spirit attracts large quantities of Elite bounds oneself to it, and has gradually formed multiple tissues such as free software alliance (FSF), open source code promotion association (OSI).
Open source software experienced the development of nearly 30 years, forms huge scale in the whole world, is opened so that Linux is leading Source software rapidly increases, while also having gradually formed similar Sourceforge open source software co-development alliance (OSSF), often Software item mesh number in individual alliance is from tens hundreds of to hundreds of thousands, and such alliance's number is also constantly increasing In.
2nd, the searching method of OSS resources
At present, major open source software alliances are typically all embedded with specialized search engine, so as to facilitate user or OSS exploitations Person searches the OSS resources (mainly source code) needed for oneself.Opened however, this specialized search engine is often only used for one Search inside source software alliance, information content is extremely limited, returns to search result not complete enough.
In addition, universal search engine (such as Google) can also be used to be searched in the Web page of magnanimity in the prior art Rope OSS resources.By taking Google as an example, the keyword of open source software is inputted, Google can return to search result list, user can be with OSS resources are obtained by browsing search result list.It is more complete that this way returns to search result, however, using logical The result returned with search engine is often mingled with the substantial amounts of Web page for not containing OSS resources, and therefore, user must flip through greatly The page of amount finds desired software, using being inconvenient.Therefore it is current in the urgent need to search completeness can be improved simultaneously With the solution of the OSS resource searchings of accuracy.
3rd, the existing Text Classification based on machine learning
In the prior art, there is a kind of Text Classification based on machine learning, the technology can apply to webpage point Class.However, the Text Classification based on machine learning recognizes can there is following defect for the OSSP pages:
1st, the OSSP pages are different from common Web page, it is impossible to simply choose keyword according to word frequency.Such as to identification The word frequency of the word in the OSSP pages such as the OSSP pages quite valuable SVN, Git, CVS, License may be not high, sometimes Possibly even only occur once.So, in classical file classification method, some words unrelated with OSS but larger word frequency can The grader based on machine learning can be inputted as principal character, and then cause recognition result accuracy relatively low.
2nd, in the Web page of magnanimity, there is the substantial amounts of page related to OSS, such as a certain OSS of the brief introduction page, Many features with the OSSP pages of this kind of related pages, but lack the entrance that code release controls storehouse, that is to say, that user Source code can not be obtained from this kind of OSS related pages.It is readily appreciated that, in classical file classification method, possibility will be substantial amounts of OSS related pages are mistaken for the OSSP pages, and this also causes recognition result accuracy to substantially reduce.
In summary, currently in the urgent need to the OSSP page identification methods and OSS resources of a kind of web oriented of pinpoint accuracy Extracting method.
The content of the invention
Know it is an object of the invention to provide a kind of VSM classifier trainings method of web oriented of pinpoint accuracy, the OSSP pages Other method and OSS Resource Access methods.
To achieve the above object, the invention provides a kind of VSM classifier trainings method, this method is known based on the OSSP pages Other characteristic vector, VSM graders are trained with original training set;The OSSP pages recognition feature vector is:In software version control Tubulation reason keyword, mail tabulation keyword, bug tracking keyword, facilitate developer's list keyword, certificate keyword, modification day Wherein 7 or selection all 8 conducts point are chosen in will keyword, task list keyword, and software control administration order The VSM grader characteristic vectors of amount and composition.
Present invention also offers a kind of OSSP page identification methods of web oriented, comprise the following steps (as shown in Figure 1):
1) OSSP page recognition feature vectors are based on, VSM graders are trained with original training set;The OSSP pages identification Characteristic vector is:Closed in software version control management keyword, mail tabulation keyword, bug tracking keyword, facilitate developer's list Wherein 7 are chosen in keyword, certificate keyword, modification daily record keyword, task list keyword, and software control administration order The VSM grader characteristic vectors that item or whole 8 of selection are constituted as component;
2) to each Web page to be identified, the OSSP page recognition feature vectors of each Web page are extracted respectively, then Identify whether the Web page is the OSSP pages with the VSM graders trained.
Present invention also offers a kind of OSS Resource Access methods of web oriented, comprise the following steps:
1) the OSSP pages in Web page are identified according to the OSSP page identification methods of above-mentioned web oriented;
2) OSS resources are searched in the OSSP pages identified and are downloaded it to local.
Compared with prior art, the present invention has following technique effect:
1st, the present invention can significantly improve the accuracy of the OSSP pages identification of web oriented;
2nd, the present invention can improve the completeness of OSS resource searchings and download;
3rd, the present invention can more accurately obtain OSS resources.
Brief description of the drawings
Fig. 1 shows the flow chart of the OSSP page identification methods of the web oriented of the embodiment of the present invention 1;
Fig. 2 shows the flow chart of the OSS Resource Access methods of the web oriented of the embodiment of the present invention 2.
Embodiment
Preferably to illustrate the present invention, the definition of the OSS and OSSP pages is introduced first, and it is existing based on machine learning Text Classification.
1., the definition of the OSS and OSSP pages
Open source code promotion association (OSI) is defined as follows (including 10 aspects) to OSS:
1st, freely re-issue
Licensing can not limit any group's sale or give software, and software can be that the program of several separate sources is integrated One of original paper in software publishing version afterwards.Licensing can not require to collect license fee or other expenses to such sale With.
2nd, program source code
Program must include source code.It must be allowed for release and also include program source generation while comprising compiling form Code.Source code is not included when product is issued with some form, it is necessary to which very eye-catching informs user, how to pass through The download source codes of Internet freely.Source code must be provided in the form of preferentially being selected when programmer changes program. Intentionally upsetting source code does not allow.It is also not using intermediate form as preprocessor or translater as source code Allow.
3rd, program is derived from
Licensing must be allowed for changing or derivation program.Must be allowed for these programs by with initial software identical licensing Distribution.
4th, the integrality of author's source code
Only when licensing allowed in the program development stage, in order to adjustment programme purpose by the release of " patch file " When being issued together with source code, licensing could limit source code and be issued in the form of after changing.Licensing must clearly be permitted Perhaps the program distribution set up by the source code after change.It is different from initial software that licensing can require that the program derived from is used Title or version number.
5th, personal or group is not discriminated against
Licensing must must not discriminate against any personal or group.
6th, the trial in not discrimination field
Licensing can not limit anyone and attempt program being applied to some specific field.For example being unable to limiting program should For commercial field, or applied to genetic research.
7th, licensing is issued
The power for being attached to program must be adapted for all program retail traders, without being added again between these groups Other extra licensings.
8th, licensing can not some special product
If program is the part in a certain release of certain software, the power for being attached to the program is not required to rely on In this release.If program be from a certain release take passages out, using or distribution when be all that program Licensing, all entities of distribution program should all possess and all power that is allowed of initial software version.
9th, licensing can not repel other software
Licensing can not limit the other software issued with the license software.For example, licensing can not require institute It is all open source software to have the other software issued therewith.
10th, licensing must be that technology is neutral
Licensing can not be set up on the basis of any individual skill or interface style.
It is defined above extremely complex, in fact, whether those skilled in the art can be simply from License and sources The two aspects of code judge whether a software resource is OSS resources.In the present invention, as long as a software resource has simultaneously There are License and source code, that is, determine that it is OSS resources.Each OSS resource typically has corresponding Open Source Software Community's homepage, for convenience of describing, is referred to as OSSP by Open Source Software community homepage herein.It is readily appreciated that, if it is possible to profit Identify which is the OSSP pages from the Web page of magnanimity with learning machine, then the search of OSS resources just can be simultaneously in efficiency Significantly improved with two aspects of completeness.In the present invention, the OSSP pages refer to that providing code release controls entering for storehouse Mouthful, it can download and upload the page of OSS source codes.Based on this definition, when constructing original training set, this area it is common Technical staff can intuitively and uniquely judge whether a web page is the OSSP pages.
2., the existing Text Classification based on machine learning
In the prior art, there is a kind of Text Classification based on machine learning, the technology can apply to webpage point Class.A kind of classical file classification method is briefly described below, text classification mainly includes following steps:
The foundation of text representation → training sample set → training grader → classification prediction.
The topmost method of text representation is exactly that vector space represents model (Vector Space Model, VSM), existing In technology, mainly using word (or phrase) as item, weight is calculated based on the frequency of item, each text d may be expressed as By word and word frequency to the vector constituted, d={ (t1, w1d), (t2, w2d) ..., (tn, wnd)}。
Training sample set is exactly the set that limited one is made up of text vector and text generic, its form of expression such as table 1
Table 1
Term1 Term2 ...... Classification
Text 1 Word frequency of the Term1 in document 1 Word frequency of the Term2 in document 1 ...... Physical culture
Text 2 Word frequency of the Term1 in document 2 Word frequency of the Term2 in document 2 ...... Music
...... ...... ...... ...... ......
At present, the grader based on machine learning mainly has:The side such as SVM, Bayes, linear classification, decision tree and k-NN Method, SVM has sturdy theoretical foundation, more accurate than most of other algorithms in many application fields, especially high in processing During dimension data;In addition, many researchers think that SVM is probably to solve the problems, such as text classification algorithm the most accurate, so one As selection SVM be main grader.
The main operational principle (simplest situation) of SVM classifier:SVM is a linear learning system, main to use In two-value classification problem.Training sample set is { (X1, y1), (X2, y2) ..., (Xn, yn), wherein Xi=(xi1, xi2..., xir) It is the input vector of a r dimension, yiIt is XiGeneric mark.Such as, for table, the input vector of text 1 is X1=(w11, w21..., wr1), generic is labeled as yi∈ { physical culture, music }.
SVM is exactly to find a linear function (1)
F (X)=<W·X>+b (1)
If f (Xi) the so X of > 0iPositive class is endowed, negative class is otherwise endowed, be i.e. (2)
For table, if f (X1) > 0 illustrate text 1 be categorized as y1=physical culture;f(X2) < 0 illustrate text 2 be categorized as y2=music.
With reference to above-mentioned analysis, it can be seen that above-mentioned classical file classification method is recognized into can exist for the OSSP pages Following defect:
1st, the OSSP pages are different from common Web page, it is impossible to simply choose keyword according to word frequency.Such as to identification The word frequency of the word in the OSSP pages such as the OSSP pages quite valuable SVN, Git, CVS, License may be not high, sometimes Possibly even only occur once.So, in classical file classification method, some words unrelated with OSS but larger word frequency can The grader based on machine learning can be inputted as principal character, and then cause recognition result accuracy relatively low.
2nd, in the Web page of magnanimity, there is the substantial amounts of page related to OSS, such as a certain OSS of the brief introduction page, Many features with the OSSP pages of this kind of related pages, but lack the entrance that code release controls storehouse, that is to say, that user Source code can not be obtained from this kind of OSS related pages.It is readily appreciated that, in classical file classification method, possibility will be substantial amounts of OSS related pages are mistaken for the OSSP pages, and this also causes recognition result accuracy to substantially reduce.
The present invention is further described with reference to specific embodiment.
Embodiment 1
There is provided a kind of face for being based on VSM (vector space represents model) grader according to one embodiment of present invention To Web OSSP page identification methods, this method comprises the following steps:
1) one group of keyword is chosen as the characteristic vector of VSM graders;
2) be based on step 1) characteristic vector, with original training set train VSM graders;
3) web page identification is carried out with the VSM graders trained.
The present embodiment additionally provides corresponding OSS Resource Access method, and this method is according to above-mentioned steps 1) 2) 3) identification Go out the OSSP pages, then search OSS resources (such as source code) in the OSSP pages, and download it to local memory device.
Each above-mentioned step is introduced separately below.
First, keyword is chosen
In step 1) in, the characteristic vectors of VSM graders by one group of different types of crucial phrase into.In the present embodiment, Keyword is divided into by type:Software version control management keyword, mail tabulation keyword, Bug tracking keyword, developer's row Table keyword, certificate keyword, modification daily record keyword and task list keyword.
Wherein, software version control management keyword includes SVN, Git or CVS.As long as in a Web page containing SVN, Any of Git, CVS word, you can judge that the web page has software version control management keyword;Otherwise the Web page is judged Face is without software version control management keyword.
Mail tabulation keyword includes Mailing Lists, Mail_List or Email_List.As long as a Web page In contain any of Mailing Lists, Mail_List or Email_List word, you can judge that the Web page has mail List keyword;Otherwise judge that the Web page does not have mail tabulation keyword.
Bug tracking keywords include Bug Trackers, Issue Tracker or Bug Report.As long as a Web page Contain any of Bug Trackers, Issue Tracker, Bug Report word in face, you can judge that the Web page has Bug tracks keyword;Otherwise judge that the Web page tracks keyword without Bug.
Facilitate developer's list keyword include Developer List, Member List, Project Memberlist, Blogger List, View Members or Author.As long as containing DeveloperList, Member in a Web page Any of List, Project Memberlist, Blogger List, View Members, Author word, you can judging should Web page has facilitate developer's list keyword;Otherwise judge that the Web page does not have facilitate developer's list keyword.
Certificate keyword includes GPL, Apache License, BSD License, MIT license, Mozilla Public License, Common Development and Distribution License or Eclipse Public License.As long as in a Web page containing GPL, Apache License, BSDLicense, MIT License, Mozilla Public License、Common Development andDistribution License、Eclipse Any of Public License word, you can judge that the Web page has certificate keyword;Otherwise the Web page is judged not With certificate keyword.
Changing daily record keyword includes Change Log, Commit Log, Update Log.As long as in a Web page Contain any of Change Log, Commit Log, Update Log word, you can judge that the Web page has modification daily record Keyword;Otherwise judge the Web page without modification daily record keyword.
Task list keyword includes task lists.
2nd, the training of VSM graders
In step 2) in, built based on the OSSP pages of known open source software co-development alliance (OSSF) initial Training sample set.In initial training sample set, for an OSSP page, corresponding VSM characteristic vectors are:(software version This control manages keyword, and mail tabulation keyword, Bug tracking keywords, facilitate developer's list keyword, certificate keyword is repaiied Some other day will keyword, task list keyword).The value of each keyword is " 0 " or " 1 ", represents that the OSSP pages do not have respectively Have or the keyword with corresponding types.And the output valve of VSM graders is also " 0 " or " 1 ", "No" or "Yes" are represented respectively The OSSP pages.
To increase the accuracy of VSM graders, it can further increase in initial training sample set and be identified manually Web page.According to previously described definition there is provided the entrance that code release controls storehouse, OSS sources generation can be downloaded and uploaded The page of code, you can be considered the OSSP pages.Based on this definition, one of ordinary skill in the art can intuitively and uniquely Whether judge a Web page is the OSSP pages.
Specifically, it is typically related to OSS to 100 according to OSSP page definitions by those of ordinary skill in the art Whether Web page is judged whether draw it is the OSSP pages, ultimately form with crucial comprising software version control management Word, mail tabulation keyword, Bug tracking keywords, facilitate developer's list keyword, certificate keyword is changed daily record keyword, appointed Be engaged in list keyword is attribute, using whether be the OSSP pages as generic training sample set.
The VSM characteristic vectors of each OSSP pages in initial training sample set or non-OSSP Web page are inputted into VSM Grader, while VSM graders are also assigned by the VSM output valves corresponding to the OSSP pages or non-OSSP Web page, so that Obtain the VSM graders by initial training.
3rd, the identification of web page
In step 3) in, for each Web page to be identified, whether computer is retrieved in the Web page respectively has Software version control management keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate are crucial Word, modification daily record keyword and task list keyword, so as to draw the VSM characteristic vectors corresponding to the Web page.Should The VSM graders that the input of VSM characteristic vectors was trained, draw VSM output valves, if VSM output valves are " 1 ", the Web page It is the OSSP pages, if VSM output valves are " 0 ", the Web page is not the OSSP pages.
, can be further by the VSM characteristic vectors and its VSM output valves of the current Web page when VSM output valves are " 1 " Training sample set is added, it makees further training to classify to VSM, with the further accuracy for improving identification.
Above-described embodiment is used as features training grader using the keyword in OSSP page texts.However, using only text In keyword, it is possible that the problem of false positive.The a certain OSS of the such as brief introduction page, this kind of related pages have OSSP Many features of the page, but lack the entrance that code release controls storehouse, that is to say, that user can not be from this kind of OSS related pages Source code is obtained, therefore this kind of Web page is not the OSSP pages.And enter using only the keyword in text as characteristic vector During row identification, it is possible to which the page of a large amount of brief introduction properties is mistaken for into the OSSP pages.Therefore, present invention also offers be preferable to carry out Example, the preferred embodiment is basically identical with embodiment above, and difference is to employ different VSM characteristic vectors.Preferred real Apply in example, VSM characteristic vectors are in addition to keyword, in addition to OSSP page structure features.OSSP page structures feature includes soft Part controls administration order.In a preferred embodiment, an element is increased in VSM characteristic vectors -- software controls administration order.Root Determine that the software corresponded in VSM characteristic vectors controls administration order according to whether web page there is software to control administration order Value be " 1 " or " 0 ".The remainder of the preferred embodiment and foregoing one embodiment are completely the same, here no longer Repeat.
In the preferred embodiment, software control administration order includes:It is newest in the order of download for the first time, download server Renewal order, the order for detecting some revision version, the order of the tracked file of addition, the order for deleting tracked file Or submit the order of change.As long as containing newest in the order, download server downloaded for the first time in i.e. one Web page The order of renewal, the order for detecting some revision version, the order of the tracked file of addition, the order for deleting tracked file, Submit any of order of change order, you can judge that there is the Web page software to control administration order;Otherwise judging should Web page controls administration order without software.
Further, the software control administration order of the OSSP pages includes:(three kinds of control management comprising current main flow are soft Part-svn, cvs, git commonly used command)
(1) download for the first time, including source code and version repository:
Management software is controlled for SVN, the order is:
svn checkout http://path/to/repo repo_name
Management software is controlled for CVS, the order is:
cvs checkout project_Lname
Management software is controlled for Git, the order is:
git-clone\git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ linux-2.6.git\linux-2.6
(2) renewal newest in download server
Management software is controlled for SVN, the order is:
svn update[-r rev]PATH
Management software is controlled for CVS, the order is:
cvs update
Management software is controlled for Git, the order is:
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/ linux-2.6.git
(3) some revision version is detected
Management software is controlled for SVN, the order is:
svn checkout-r<rev>
Management software is controlled for CVS, the order is:
cvs checkout-r rel-1-0 tc
Management software is controlled for Git, the order is:
git reset-hard-r<rev>
(4) tracked file is added
Management software is controlled for SVN, the order is:
svn add PATH...
Management software is controlled for CVS, the order is:
cvs add new_file
Management software is controlled for Git, the order is:
git-add Documentation/Sandwiches
(5) tracked file is deleted
Management software is controlled for SVN, the order is:
svn delete PATH
Management software is controlled for CVS, the order is:
cvs rm file_name
Management software is controlled for Git, the order is:
git rm/path/to/file Svn8.Com
(6) change is submitted
Management software is controlled for SVN, the order is:
svn status-v PATH
Management software is controlled for CVS, the order is:
cvs commit-m″write some comments here″file_name
Management software is controlled for Git, the order is:
git commit
In addition to above-described embodiment, VSM characteristic vectors of the invention can also use other combinations.Such as, can be with Closed from software version control management keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate Select any 7 compositions VSM special in keyword, modification daily record keyword and task list keyword and software control administration order Levy vector.
VSM graders can select the graders such as SVM, Bayes, linear classification, decision tree or k-NN, wherein, it is adapted to two The method of value classification has SVM and decision tree, and this above-described embodiment belongs to the category of two-value classification, so optional grader is SVM and decision tree.
Some actual test data of the present invention are given below.
Test sample collection explanation:The test sample collection of the present invention is as the training sample set construction method of grader, all Be by those of ordinary skill in the art according to OSSP page definitions to 100 typically the Web page related to OSS sentence Whether fixed, it is the OSSP pages to draw it, is ultimately formed so that whether, comprising software version control management keyword, mail tabulation is crucial Word, Bug tracking keywords, facilitate developer's list keyword, certificate keyword changes daily record keyword, and task list keyword is Attribute, using whether be the OSSP pages as generic sample set.
Experiment condition explanation:
Hardware configuration:SONY NW series (CPU:Double-core 2.1G, internal memory:4G)
Software merit rating:Operating system is WIN7, and compilation run environment is Eclipse Java EE IDE for WebDevelopers, database is MySQL 5.0.89.
The definition of accuracy is as shown in table 2:
Table 2
Note:TP:Originally it is positive example, is correctly categorized as the number (true positive) of positive example
FN:Originally it was positive example, by the number (false negative) for being categorized as counter-example of mistake
FP:Originally it was counter-example, by the number (false positive) for being categorized as positive example of mistake
TN:Originally it is counter-example, is correctly categorized as the number (true negative) of counter-example
The classifying quality that different classifications device is produced using the inventive method is as shown in table 3:
Table 3
From table 3 it is observed that being either based on SVM classifier or decision tree classifier, the present invention is relative to tradition The Text Classification based on machine learning method by significantly improving.
Embodiment 2
Present embodiments provide a kind of automatic intelligent OSS Resource Access methods, this method is from open source software alliance Webpage sets out, by migration in the various links on the page, learns the feature of the page and link, automatically, efficiently recognizes OSSP The page, most OSS Resource Access is come out at last, and local data base is arrived in storage.In the present embodiment, described OSS resources can be OSS Information, OSS information include dbase, exploitation community entry address, development teams entry address, mail tabulation entry address, Bug list entries address, code release control system entry address.
As shown in Fig. 2 the OSS information extracting methods of the present embodiment comprise the following steps:
They are stored in link buffering queue (seed lists of links) by step 1, the major open source software alliance network address of typing.
In step 2, automatic reading queue link, the webpage that analysis link is pointed to, according to different in webpage are not read for one Link type migration, whether be the OSSP page, and capture identified OSSP pages if judging the web page occurred in migration path Face, ultimately forms an OSSP collections of web pages, while more new url buffering queue.The wherein recognition methods and implementation of the OSSP pages Example 1 is consistent, repeats no more here.When identifying the new OSSP pages, the OSSP pages can be stored in OSSP learning sample collection, Constantly to train grader.
Step 3, each the OSSP page automatically analyzed in OSSP collections of web pages, recognize OSS association attributes, extract every The corresponding OSS information of one OSSP webpage.OSS information includes dbase, exploitation community entry address, development teams entrances Location, mail tabulation entry address, Bug list entries address, code release control system entry address.
Step 4, the OSS information deposit database table by extraction, the field of database table include<Dbase, develops society Area entry address, development teams entry address, mail tabulation entry address, Bug list entries address, code release control system Entry address>.
It should be noted last that, the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although ginseng The present invention is described in detail according to preferred embodiment, it will be understood by those within the art that, can be to the present invention Technical scheme modify or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention.

Claims (12)

1. a kind of VSM classifier training methods of web oriented, including:
Based on open source software homepage page recognition feature vector, VSM graders are trained with original training set;The open source software master Page page recognition feature vector be:Software version control management keyword, mail tabulation keyword, Bug tracking keyword, open Originator list keyword, certificate keyword, modification daily record keyword, task list keyword, and software control administration order It is middle to choose the VSM grader characteristic vectors that wherein 7 or whole 8 of selection are constituted as component.
2. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the software version control Tubulation reason keyword includes SVN, Git or CVS.
3. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the mail tabulation is closed Keyword includes Mailing Lists, Mail_List or Email_List.
4. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the Bug tracking is closed Keyword includes Bug Trackers, Issue Tracker or Bug Report.
5. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the facilitate developer's list Keyword includes Developer, Developer List, Member List, Project Memberlist, Blogger List, View Members or Author.
6. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the certificate keyword Including License, GPL, Apache License, BSD License, MIT License, Mozilla Public License, Common Development and Distribution License or Eclipse Public License.
7. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the modification daily record is closed Keyword includes Change Log, Commit Log or Update Log.
8. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the task list is closed Keyword includes Task Lists.
9. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the software control pipe Reason order includes the order of download for the first time, the order of renewal newest in download server, the life for detecting some revision version Make, add the order of tracked file, deleting the order of tracked file or submit the order of change.
10. the open source software homepage page identification method of a kind of web oriented, it is characterised in that comprise the following steps:
1) to each Web page to be identified, the open source software homepage page recognition feature vector of each Web page is extracted respectively, The open source software homepage page recognition feature vector is:Software version control management keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate keyword, modification daily record keyword, task list keyword, and The VSM grader characteristic vectors that wherein 7 or whole 8 of selection are constituted as component are chosen in software control administration order;
2) and then the VSM graders that are trained according to the VSM classifier training methods described in one of claim 1~9 are utilized Whether identify the Web page is the open source software homepage page.
11. the open source software resource acquiring method of a kind of web oriented, it is characterised in that comprise the following steps:
1) the open source software homepage page identification method of web oriented according to claim 10 is identified in Web page The open source software homepage page;
2) open source software resource is searched in the open source software homepage page identified.
12. the open source software resource acquiring method of web oriented according to claim 11, it is characterised in that under also including Row step:
3) by step 2) the open source software resource downloading that is found is to local.
CN201010609743.0A 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented Expired - Fee Related CN102541913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010609743.0A CN102541913B (en) 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010609743.0A CN102541913B (en) 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented

Publications (2)

Publication Number Publication Date
CN102541913A CN102541913A (en) 2012-07-04
CN102541913B true CN102541913B (en) 2017-10-03

Family

ID=46348830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010609743.0A Expired - Fee Related CN102541913B (en) 2010-12-15 2010-12-15 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented

Country Status (1)

Country Link
CN (1) CN102541913B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103078897B (en) * 2012-11-29 2015-11-18 中山大学 A kind of system realizing Web service fine grit classification and management
CN103226509B (en) * 2013-04-08 2016-03-30 上海华力微电子有限公司 A kind of method of system journal automatic analysis
CN110188536B (en) * 2019-05-22 2021-04-20 北京邮电大学 Application program detection method and device
CN110990035B (en) * 2019-11-01 2023-03-14 中国人民解放军63811部队 Chain type software upgrading method based on Git

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719436A (en) * 2004-07-09 2006-01-11 中国科学院自动化研究所 A kind of method and device of new proper vector weight towards text classification
CN101055621A (en) * 2006-04-10 2007-10-17 中国科学院自动化研究所 Content based sensitive web page identification method
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719436A (en) * 2004-07-09 2006-01-11 中国科学院自动化研究所 A kind of method and device of new proper vector weight towards text classification
CN101055621A (en) * 2006-04-10 2007-10-17 中国科学院自动化研究所 Content based sensitive web page identification method
CN101281521A (en) * 2007-04-05 2008-10-08 中国科学院自动化研究所 Method and system for filtering sensitive web page based on multiple classifier amalgamation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向软件开发信息库的数据挖掘综述;白洁,李春平;《计算机应用研究》;20080131;第25卷(第1期);第1节、第3.1.1节 *

Also Published As

Publication number Publication date
CN102541913A (en) 2012-07-04

Similar Documents

Publication Publication Date Title
US8484245B2 (en) Large scale unsupervised hierarchical document categorization using ontological guidance
US20060288275A1 (en) Method for classifying sub-trees in semi-structured documents
CN109933660B (en) API information search method towards natural language form based on handout and website
Babur et al. Hierarchical clustering of metamodels for comparative analysis and visualization
US9251245B2 (en) Generating mappings between a plurality of taxonomies
López et al. ModelSet: a dataset for machine learning in model-driven engineering
CN103778206A (en) Method for providing network service resources
Chen et al. Recommending software features for mobile applications based on user interface comparison
CN102541913B (en) VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented
Meusel et al. Towards automatic topical classification of LOD datasets
CN116108191A (en) Deep learning model recommendation method based on knowledge graph
Sara-Meshkizadeh et al. Webpage classification based on compound of using HTML features & URL features and features of sibling pages
Yang et al. User story clustering in agile development: a framework and an empirical study
Kumar et al. A systematic review of semantic clone detection techniques in software systems
CN116894495A (en) Method, computer readable medium and system for enhancing machine learning pipeline with annotations
Sun et al. A scenario model aggregation approach for mobile app requirements evolution based on user comments
Park et al. Extracting search intentions from web search logs
Kovacevic et al. Providing answers to questions from automatically collected web pages for intelligent decision making in the construction sector
De Bonis et al. Graph-based methods for Author Name Disambiguation: a survey
Suresh et al. A fuzzy based hybrid hierarchical clustering model for twitter sentiment analysis
Velloso et al. Web page structured content detection using supervised machine learning
ElGindy et al. Capturing place semantics on the geosocial web
Lamba et al. Predictive Modeling
Xiao et al. Listening to the crowd for the change file localization of mobile apps
CN109299381A (en) A kind of software defect retrieval and analysis system and method based on semantic concept

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171003

Termination date: 20201215