CN102541913B - VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented - Google Patents
VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented Download PDFInfo
- Publication number
- CN102541913B CN102541913B CN201010609743.0A CN201010609743A CN102541913B CN 102541913 B CN102541913 B CN 102541913B CN 201010609743 A CN201010609743 A CN 201010609743A CN 102541913 B CN102541913 B CN 102541913B
- Authority
- CN
- China
- Prior art keywords
- keyword
- vsm
- ossp
- oss
- pages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention provides the VSM classifier trainings of web oriented, the identification of the OSSP pages and OSS Resource Access methods.Wherein, VSM classifier trainings method includes:Based on OSSP page recognition feature vectors, VSM graders are trained with original training set;The OSSP pages recognition feature vector is:The VSM grader characteristic vectors that wherein 7 or whole 8 of selection are constituted as component are chosen in software version control manages keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate keyword, modification daily record keyword, task list keyword, and software control administration order.OSSP page identification methods are then to identify whether Web page is the OSSP pages according to the VSM graders trained.OSS Resource Access method is then that OSS resources are searched in the OSSP pages identified and are downloaded it to local.The present invention can significantly improve the accuracy of the OSSP pages identification of web oriented;Enough improve OSS resource searchings and the completeness downloaded;OSS resources can more accurately be obtained.
Description
Technical field
The present invention relates to Web document classification and Web page information extraction technology field, specifically, the present invention relates to VSM
Classifier training method, OSSP (open source software homepage) page identification methods and OSS (open source software) Resource Access method.
Background technology
First, OSS brief introductions
Open source software English is Open Source Software, is abbreviated as OSS.Move and send out from nineteen eighty-three free software
Rise, the supporter of open source software advocates the spirit of " freely, participate in, offer as a tribute and cooperate " always, and this spirit attracts large quantities of
Elite bounds oneself to it, and has gradually formed multiple tissues such as free software alliance (FSF), open source code promotion association (OSI).
Open source software experienced the development of nearly 30 years, forms huge scale in the whole world, is opened so that Linux is leading
Source software rapidly increases, while also having gradually formed similar Sourceforge open source software co-development alliance (OSSF), often
Software item mesh number in individual alliance is from tens hundreds of to hundreds of thousands, and such alliance's number is also constantly increasing
In.
2nd, the searching method of OSS resources
At present, major open source software alliances are typically all embedded with specialized search engine, so as to facilitate user or OSS exploitations
Person searches the OSS resources (mainly source code) needed for oneself.Opened however, this specialized search engine is often only used for one
Search inside source software alliance, information content is extremely limited, returns to search result not complete enough.
In addition, universal search engine (such as Google) can also be used to be searched in the Web page of magnanimity in the prior art
Rope OSS resources.By taking Google as an example, the keyword of open source software is inputted, Google can return to search result list, user can be with
OSS resources are obtained by browsing search result list.It is more complete that this way returns to search result, however, using logical
The result returned with search engine is often mingled with the substantial amounts of Web page for not containing OSS resources, and therefore, user must flip through greatly
The page of amount finds desired software, using being inconvenient.Therefore it is current in the urgent need to search completeness can be improved simultaneously
With the solution of the OSS resource searchings of accuracy.
3rd, the existing Text Classification based on machine learning
In the prior art, there is a kind of Text Classification based on machine learning, the technology can apply to webpage point
Class.However, the Text Classification based on machine learning recognizes can there is following defect for the OSSP pages:
1st, the OSSP pages are different from common Web page, it is impossible to simply choose keyword according to word frequency.Such as to identification
The word frequency of the word in the OSSP pages such as the OSSP pages quite valuable SVN, Git, CVS, License may be not high, sometimes
Possibly even only occur once.So, in classical file classification method, some words unrelated with OSS but larger word frequency can
The grader based on machine learning can be inputted as principal character, and then cause recognition result accuracy relatively low.
2nd, in the Web page of magnanimity, there is the substantial amounts of page related to OSS, such as a certain OSS of the brief introduction page,
Many features with the OSSP pages of this kind of related pages, but lack the entrance that code release controls storehouse, that is to say, that user
Source code can not be obtained from this kind of OSS related pages.It is readily appreciated that, in classical file classification method, possibility will be substantial amounts of
OSS related pages are mistaken for the OSSP pages, and this also causes recognition result accuracy to substantially reduce.
In summary, currently in the urgent need to the OSSP page identification methods and OSS resources of a kind of web oriented of pinpoint accuracy
Extracting method.
The content of the invention
Know it is an object of the invention to provide a kind of VSM classifier trainings method of web oriented of pinpoint accuracy, the OSSP pages
Other method and OSS Resource Access methods.
To achieve the above object, the invention provides a kind of VSM classifier trainings method, this method is known based on the OSSP pages
Other characteristic vector, VSM graders are trained with original training set;The OSSP pages recognition feature vector is:In software version control
Tubulation reason keyword, mail tabulation keyword, bug tracking keyword, facilitate developer's list keyword, certificate keyword, modification day
Wherein 7 or selection all 8 conducts point are chosen in will keyword, task list keyword, and software control administration order
The VSM grader characteristic vectors of amount and composition.
Present invention also offers a kind of OSSP page identification methods of web oriented, comprise the following steps (as shown in Figure 1):
1) OSSP page recognition feature vectors are based on, VSM graders are trained with original training set;The OSSP pages identification
Characteristic vector is:Closed in software version control management keyword, mail tabulation keyword, bug tracking keyword, facilitate developer's list
Wherein 7 are chosen in keyword, certificate keyword, modification daily record keyword, task list keyword, and software control administration order
The VSM grader characteristic vectors that item or whole 8 of selection are constituted as component;
2) to each Web page to be identified, the OSSP page recognition feature vectors of each Web page are extracted respectively, then
Identify whether the Web page is the OSSP pages with the VSM graders trained.
Present invention also offers a kind of OSS Resource Access methods of web oriented, comprise the following steps:
1) the OSSP pages in Web page are identified according to the OSSP page identification methods of above-mentioned web oriented;
2) OSS resources are searched in the OSSP pages identified and are downloaded it to local.
Compared with prior art, the present invention has following technique effect:
1st, the present invention can significantly improve the accuracy of the OSSP pages identification of web oriented;
2nd, the present invention can improve the completeness of OSS resource searchings and download;
3rd, the present invention can more accurately obtain OSS resources.
Brief description of the drawings
Fig. 1 shows the flow chart of the OSSP page identification methods of the web oriented of the embodiment of the present invention 1;
Fig. 2 shows the flow chart of the OSS Resource Access methods of the web oriented of the embodiment of the present invention 2.
Embodiment
Preferably to illustrate the present invention, the definition of the OSS and OSSP pages is introduced first, and it is existing based on machine learning
Text Classification.
1., the definition of the OSS and OSSP pages
Open source code promotion association (OSI) is defined as follows (including 10 aspects) to OSS:
1st, freely re-issue
Licensing can not limit any group's sale or give software, and software can be that the program of several separate sources is integrated
One of original paper in software publishing version afterwards.Licensing can not require to collect license fee or other expenses to such sale
With.
2nd, program source code
Program must include source code.It must be allowed for release and also include program source generation while comprising compiling form
Code.Source code is not included when product is issued with some form, it is necessary to which very eye-catching informs user, how to pass through
The download source codes of Internet freely.Source code must be provided in the form of preferentially being selected when programmer changes program.
Intentionally upsetting source code does not allow.It is also not using intermediate form as preprocessor or translater as source code
Allow.
3rd, program is derived from
Licensing must be allowed for changing or derivation program.Must be allowed for these programs by with initial software identical licensing
Distribution.
4th, the integrality of author's source code
Only when licensing allowed in the program development stage, in order to adjustment programme purpose by the release of " patch file "
When being issued together with source code, licensing could limit source code and be issued in the form of after changing.Licensing must clearly be permitted
Perhaps the program distribution set up by the source code after change.It is different from initial software that licensing can require that the program derived from is used
Title or version number.
5th, personal or group is not discriminated against
Licensing must must not discriminate against any personal or group.
6th, the trial in not discrimination field
Licensing can not limit anyone and attempt program being applied to some specific field.For example being unable to limiting program should
For commercial field, or applied to genetic research.
7th, licensing is issued
The power for being attached to program must be adapted for all program retail traders, without being added again between these groups
Other extra licensings.
8th, licensing can not some special product
If program is the part in a certain release of certain software, the power for being attached to the program is not required to rely on
In this release.If program be from a certain release take passages out, using or distribution when be all that program
Licensing, all entities of distribution program should all possess and all power that is allowed of initial software version.
9th, licensing can not repel other software
Licensing can not limit the other software issued with the license software.For example, licensing can not require institute
It is all open source software to have the other software issued therewith.
10th, licensing must be that technology is neutral
Licensing can not be set up on the basis of any individual skill or interface style.
It is defined above extremely complex, in fact, whether those skilled in the art can be simply from License and sources
The two aspects of code judge whether a software resource is OSS resources.In the present invention, as long as a software resource has simultaneously
There are License and source code, that is, determine that it is OSS resources.Each OSS resource typically has corresponding Open Source Software
Community's homepage, for convenience of describing, is referred to as OSSP by Open Source Software community homepage herein.It is readily appreciated that, if it is possible to profit
Identify which is the OSSP pages from the Web page of magnanimity with learning machine, then the search of OSS resources just can be simultaneously in efficiency
Significantly improved with two aspects of completeness.In the present invention, the OSSP pages refer to that providing code release controls entering for storehouse
Mouthful, it can download and upload the page of OSS source codes.Based on this definition, when constructing original training set, this area it is common
Technical staff can intuitively and uniquely judge whether a web page is the OSSP pages.
2., the existing Text Classification based on machine learning
In the prior art, there is a kind of Text Classification based on machine learning, the technology can apply to webpage point
Class.A kind of classical file classification method is briefly described below, text classification mainly includes following steps:
The foundation of text representation → training sample set → training grader → classification prediction.
The topmost method of text representation is exactly that vector space represents model (Vector Space Model, VSM), existing
In technology, mainly using word (or phrase) as item, weight is calculated based on the frequency of item, each text d may be expressed as
By word and word frequency to the vector constituted, d={ (t1, w1d), (t2, w2d) ..., (tn, wnd)}。
Training sample set is exactly the set that limited one is made up of text vector and text generic, its form of expression such as table
1
Table 1
Term1 | Term2 | ...... | Classification | |
Text 1 | Word frequency of the Term1 in document 1 | Word frequency of the Term2 in document 1 | ...... | Physical culture |
Text 2 | Word frequency of the Term1 in document 2 | Word frequency of the Term2 in document 2 | ...... | Music |
...... | ...... | ...... | ...... | ...... |
At present, the grader based on machine learning mainly has:The side such as SVM, Bayes, linear classification, decision tree and k-NN
Method, SVM has sturdy theoretical foundation, more accurate than most of other algorithms in many application fields, especially high in processing
During dimension data;In addition, many researchers think that SVM is probably to solve the problems, such as text classification algorithm the most accurate, so one
As selection SVM be main grader.
The main operational principle (simplest situation) of SVM classifier:SVM is a linear learning system, main to use
In two-value classification problem.Training sample set is { (X1, y1), (X2, y2) ..., (Xn, yn), wherein Xi=(xi1, xi2..., xir)
It is the input vector of a r dimension, yiIt is XiGeneric mark.Such as, for table, the input vector of text 1 is X1=(w11,
w21..., wr1), generic is labeled as yi∈ { physical culture, music }.
SVM is exactly to find a linear function (1)
F (X)=<W·X>+b (1)
If f (Xi) the so X of > 0iPositive class is endowed, negative class is otherwise endowed, be i.e. (2)
For table, if f (X1) > 0 illustrate text 1 be categorized as y1=physical culture;f(X2) < 0 illustrate text 2 be categorized as
y2=music.
With reference to above-mentioned analysis, it can be seen that above-mentioned classical file classification method is recognized into can exist for the OSSP pages
Following defect:
1st, the OSSP pages are different from common Web page, it is impossible to simply choose keyword according to word frequency.Such as to identification
The word frequency of the word in the OSSP pages such as the OSSP pages quite valuable SVN, Git, CVS, License may be not high, sometimes
Possibly even only occur once.So, in classical file classification method, some words unrelated with OSS but larger word frequency can
The grader based on machine learning can be inputted as principal character, and then cause recognition result accuracy relatively low.
2nd, in the Web page of magnanimity, there is the substantial amounts of page related to OSS, such as a certain OSS of the brief introduction page,
Many features with the OSSP pages of this kind of related pages, but lack the entrance that code release controls storehouse, that is to say, that user
Source code can not be obtained from this kind of OSS related pages.It is readily appreciated that, in classical file classification method, possibility will be substantial amounts of
OSS related pages are mistaken for the OSSP pages, and this also causes recognition result accuracy to substantially reduce.
The present invention is further described with reference to specific embodiment.
Embodiment 1
There is provided a kind of face for being based on VSM (vector space represents model) grader according to one embodiment of present invention
To Web OSSP page identification methods, this method comprises the following steps:
1) one group of keyword is chosen as the characteristic vector of VSM graders;
2) be based on step 1) characteristic vector, with original training set train VSM graders;
3) web page identification is carried out with the VSM graders trained.
The present embodiment additionally provides corresponding OSS Resource Access method, and this method is according to above-mentioned steps 1) 2) 3) identification
Go out the OSSP pages, then search OSS resources (such as source code) in the OSSP pages, and download it to local memory device.
Each above-mentioned step is introduced separately below.
First, keyword is chosen
In step 1) in, the characteristic vectors of VSM graders by one group of different types of crucial phrase into.In the present embodiment,
Keyword is divided into by type:Software version control management keyword, mail tabulation keyword, Bug tracking keyword, developer's row
Table keyword, certificate keyword, modification daily record keyword and task list keyword.
Wherein, software version control management keyword includes SVN, Git or CVS.As long as in a Web page containing SVN,
Any of Git, CVS word, you can judge that the web page has software version control management keyword;Otherwise the Web page is judged
Face is without software version control management keyword.
Mail tabulation keyword includes Mailing Lists, Mail_List or Email_List.As long as a Web page
In contain any of Mailing Lists, Mail_List or Email_List word, you can judge that the Web page has mail
List keyword;Otherwise judge that the Web page does not have mail tabulation keyword.
Bug tracking keywords include Bug Trackers, Issue Tracker or Bug Report.As long as a Web page
Contain any of Bug Trackers, Issue Tracker, Bug Report word in face, you can judge that the Web page has
Bug tracks keyword;Otherwise judge that the Web page tracks keyword without Bug.
Facilitate developer's list keyword include Developer List, Member List, Project Memberlist,
Blogger List, View Members or Author.As long as containing DeveloperList, Member in a Web page
Any of List, Project Memberlist, Blogger List, View Members, Author word, you can judging should
Web page has facilitate developer's list keyword;Otherwise judge that the Web page does not have facilitate developer's list keyword.
Certificate keyword includes GPL, Apache License, BSD License, MIT license, Mozilla
Public License, Common Development and Distribution License or Eclipse Public
License.As long as in a Web page containing GPL, Apache License, BSDLicense, MIT License,
Mozilla Public License、Common Development andDistribution License、Eclipse
Any of Public License word, you can judge that the Web page has certificate keyword;Otherwise the Web page is judged not
With certificate keyword.
Changing daily record keyword includes Change Log, Commit Log, Update Log.As long as in a Web page
Contain any of Change Log, Commit Log, Update Log word, you can judge that the Web page has modification daily record
Keyword;Otherwise judge the Web page without modification daily record keyword.
Task list keyword includes task lists.
2nd, the training of VSM graders
In step 2) in, built based on the OSSP pages of known open source software co-development alliance (OSSF) initial
Training sample set.In initial training sample set, for an OSSP page, corresponding VSM characteristic vectors are:(software version
This control manages keyword, and mail tabulation keyword, Bug tracking keywords, facilitate developer's list keyword, certificate keyword is repaiied
Some other day will keyword, task list keyword).The value of each keyword is " 0 " or " 1 ", represents that the OSSP pages do not have respectively
Have or the keyword with corresponding types.And the output valve of VSM graders is also " 0 " or " 1 ", "No" or "Yes" are represented respectively
The OSSP pages.
To increase the accuracy of VSM graders, it can further increase in initial training sample set and be identified manually
Web page.According to previously described definition there is provided the entrance that code release controls storehouse, OSS sources generation can be downloaded and uploaded
The page of code, you can be considered the OSSP pages.Based on this definition, one of ordinary skill in the art can intuitively and uniquely
Whether judge a Web page is the OSSP pages.
Specifically, it is typically related to OSS to 100 according to OSSP page definitions by those of ordinary skill in the art
Whether Web page is judged whether draw it is the OSSP pages, ultimately form with crucial comprising software version control management
Word, mail tabulation keyword, Bug tracking keywords, facilitate developer's list keyword, certificate keyword is changed daily record keyword, appointed
Be engaged in list keyword is attribute, using whether be the OSSP pages as generic training sample set.
The VSM characteristic vectors of each OSSP pages in initial training sample set or non-OSSP Web page are inputted into VSM
Grader, while VSM graders are also assigned by the VSM output valves corresponding to the OSSP pages or non-OSSP Web page, so that
Obtain the VSM graders by initial training.
3rd, the identification of web page
In step 3) in, for each Web page to be identified, whether computer is retrieved in the Web page respectively has
Software version control management keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate are crucial
Word, modification daily record keyword and task list keyword, so as to draw the VSM characteristic vectors corresponding to the Web page.Should
The VSM graders that the input of VSM characteristic vectors was trained, draw VSM output valves, if VSM output valves are " 1 ", the Web page
It is the OSSP pages, if VSM output valves are " 0 ", the Web page is not the OSSP pages.
, can be further by the VSM characteristic vectors and its VSM output valves of the current Web page when VSM output valves are " 1 "
Training sample set is added, it makees further training to classify to VSM, with the further accuracy for improving identification.
Above-described embodiment is used as features training grader using the keyword in OSSP page texts.However, using only text
In keyword, it is possible that the problem of false positive.The a certain OSS of the such as brief introduction page, this kind of related pages have OSSP
Many features of the page, but lack the entrance that code release controls storehouse, that is to say, that user can not be from this kind of OSS related pages
Source code is obtained, therefore this kind of Web page is not the OSSP pages.And enter using only the keyword in text as characteristic vector
During row identification, it is possible to which the page of a large amount of brief introduction properties is mistaken for into the OSSP pages.Therefore, present invention also offers be preferable to carry out
Example, the preferred embodiment is basically identical with embodiment above, and difference is to employ different VSM characteristic vectors.Preferred real
Apply in example, VSM characteristic vectors are in addition to keyword, in addition to OSSP page structure features.OSSP page structures feature includes soft
Part controls administration order.In a preferred embodiment, an element is increased in VSM characteristic vectors -- software controls administration order.Root
Determine that the software corresponded in VSM characteristic vectors controls administration order according to whether web page there is software to control administration order
Value be " 1 " or " 0 ".The remainder of the preferred embodiment and foregoing one embodiment are completely the same, here no longer
Repeat.
In the preferred embodiment, software control administration order includes:It is newest in the order of download for the first time, download server
Renewal order, the order for detecting some revision version, the order of the tracked file of addition, the order for deleting tracked file
Or submit the order of change.As long as containing newest in the order, download server downloaded for the first time in i.e. one Web page
The order of renewal, the order for detecting some revision version, the order of the tracked file of addition, the order for deleting tracked file,
Submit any of order of change order, you can judge that there is the Web page software to control administration order;Otherwise judging should
Web page controls administration order without software.
Further, the software control administration order of the OSSP pages includes:(three kinds of control management comprising current main flow are soft
Part-svn, cvs, git commonly used command)
(1) download for the first time, including source code and version repository:
Management software is controlled for SVN, the order is:
svn checkout http://path/to/repo repo_name
Management software is controlled for CVS, the order is:
cvs checkout project_Lname
Management software is controlled for Git, the order is:
git-clone\git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux-2.6.git\linux-2.6
(2) renewal newest in download server
Management software is controlled for SVN, the order is:
svn update[-r rev]PATH
Management software is controlled for CVS, the order is:
cvs update
Management software is controlled for Git, the order is:
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/
linux-2.6.git
(3) some revision version is detected
Management software is controlled for SVN, the order is:
svn checkout-r<rev>
Management software is controlled for CVS, the order is:
cvs checkout-r rel-1-0 tc
Management software is controlled for Git, the order is:
git reset-hard-r<rev>
(4) tracked file is added
Management software is controlled for SVN, the order is:
svn add PATH...
Management software is controlled for CVS, the order is:
cvs add new_file
Management software is controlled for Git, the order is:
git-add Documentation/Sandwiches
(5) tracked file is deleted
Management software is controlled for SVN, the order is:
svn delete PATH
Management software is controlled for CVS, the order is:
cvs rm file_name
Management software is controlled for Git, the order is:
git rm/path/to/file Svn8.Com
(6) change is submitted
Management software is controlled for SVN, the order is:
svn status-v PATH
Management software is controlled for CVS, the order is:
cvs commit-m″write some comments here″file_name
Management software is controlled for Git, the order is:
git commit
In addition to above-described embodiment, VSM characteristic vectors of the invention can also use other combinations.Such as, can be with
Closed from software version control management keyword, mail tabulation keyword, Bug tracking keyword, facilitate developer's list keyword, certificate
Select any 7 compositions VSM special in keyword, modification daily record keyword and task list keyword and software control administration order
Levy vector.
VSM graders can select the graders such as SVM, Bayes, linear classification, decision tree or k-NN, wherein, it is adapted to two
The method of value classification has SVM and decision tree, and this above-described embodiment belongs to the category of two-value classification, so optional grader is
SVM and decision tree.
Some actual test data of the present invention are given below.
Test sample collection explanation:The test sample collection of the present invention is as the training sample set construction method of grader, all
Be by those of ordinary skill in the art according to OSSP page definitions to 100 typically the Web page related to OSS sentence
Whether fixed, it is the OSSP pages to draw it, is ultimately formed so that whether, comprising software version control management keyword, mail tabulation is crucial
Word, Bug tracking keywords, facilitate developer's list keyword, certificate keyword changes daily record keyword, and task list keyword is
Attribute, using whether be the OSSP pages as generic sample set.
Experiment condition explanation:
Hardware configuration:SONY NW series (CPU:Double-core 2.1G, internal memory:4G)
Software merit rating:Operating system is WIN7, and compilation run environment is Eclipse Java EE IDE for
WebDevelopers, database is MySQL 5.0.89.
The definition of accuracy is as shown in table 2:
Table 2
Note:TP:Originally it is positive example, is correctly categorized as the number (true positive) of positive example
FN:Originally it was positive example, by the number (false negative) for being categorized as counter-example of mistake
FP:Originally it was counter-example, by the number (false positive) for being categorized as positive example of mistake
TN:Originally it is counter-example, is correctly categorized as the number (true negative) of counter-example
The classifying quality that different classifications device is produced using the inventive method is as shown in table 3:
Table 3
From table 3 it is observed that being either based on SVM classifier or decision tree classifier, the present invention is relative to tradition
The Text Classification based on machine learning method by significantly improving.
Embodiment 2
Present embodiments provide a kind of automatic intelligent OSS Resource Access methods, this method is from open source software alliance
Webpage sets out, by migration in the various links on the page, learns the feature of the page and link, automatically, efficiently recognizes OSSP
The page, most OSS Resource Access is come out at last, and local data base is arrived in storage.In the present embodiment, described OSS resources can be OSS
Information, OSS information include dbase, exploitation community entry address, development teams entry address, mail tabulation entry address,
Bug list entries address, code release control system entry address.
As shown in Fig. 2 the OSS information extracting methods of the present embodiment comprise the following steps:
They are stored in link buffering queue (seed lists of links) by step 1, the major open source software alliance network address of typing.
In step 2, automatic reading queue link, the webpage that analysis link is pointed to, according to different in webpage are not read for one
Link type migration, whether be the OSSP page, and capture identified OSSP pages if judging the web page occurred in migration path
Face, ultimately forms an OSSP collections of web pages, while more new url buffering queue.The wherein recognition methods and implementation of the OSSP pages
Example 1 is consistent, repeats no more here.When identifying the new OSSP pages, the OSSP pages can be stored in OSSP learning sample collection,
Constantly to train grader.
Step 3, each the OSSP page automatically analyzed in OSSP collections of web pages, recognize OSS association attributes, extract every
The corresponding OSS information of one OSSP webpage.OSS information includes dbase, exploitation community entry address, development teams entrances
Location, mail tabulation entry address, Bug list entries address, code release control system entry address.
Step 4, the OSS information deposit database table by extraction, the field of database table include<Dbase, develops society
Area entry address, development teams entry address, mail tabulation entry address, Bug list entries address, code release control system
Entry address>.
It should be noted last that, the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although ginseng
The present invention is described in detail according to preferred embodiment, it will be understood by those within the art that, can be to the present invention
Technical scheme modify or equivalent substitution, without departing from the spirit and scope of technical solution of the present invention.
Claims (12)
1. a kind of VSM classifier training methods of web oriented, including:
Based on open source software homepage page recognition feature vector, VSM graders are trained with original training set;The open source software master
Page page recognition feature vector be:Software version control management keyword, mail tabulation keyword, Bug tracking keyword, open
Originator list keyword, certificate keyword, modification daily record keyword, task list keyword, and software control administration order
It is middle to choose the VSM grader characteristic vectors that wherein 7 or whole 8 of selection are constituted as component.
2. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the software version control
Tubulation reason keyword includes SVN, Git or CVS.
3. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the mail tabulation is closed
Keyword includes Mailing Lists, Mail_List or Email_List.
4. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the Bug tracking is closed
Keyword includes Bug Trackers, Issue Tracker or Bug Report.
5. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the facilitate developer's list
Keyword includes Developer, Developer List, Member List, Project Memberlist, Blogger
List, View Members or Author.
6. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the certificate keyword
Including License, GPL, Apache License, BSD License, MIT License, Mozilla Public
License, Common Development and Distribution License or Eclipse Public License.
7. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the modification daily record is closed
Keyword includes Change Log, Commit Log or Update Log.
8. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the task list is closed
Keyword includes Task Lists.
9. the VSM classifier training methods of web oriented according to claim 1, it is characterised in that the software control pipe
Reason order includes the order of download for the first time, the order of renewal newest in download server, the life for detecting some revision version
Make, add the order of tracked file, deleting the order of tracked file or submit the order of change.
10. the open source software homepage page identification method of a kind of web oriented, it is characterised in that comprise the following steps:
1) to each Web page to be identified, the open source software homepage page recognition feature vector of each Web page is extracted respectively,
The open source software homepage page recognition feature vector is:Software version control management keyword, mail tabulation keyword,
Bug tracking keyword, facilitate developer's list keyword, certificate keyword, modification daily record keyword, task list keyword, and
The VSM grader characteristic vectors that wherein 7 or whole 8 of selection are constituted as component are chosen in software control administration order;
2) and then the VSM graders that are trained according to the VSM classifier training methods described in one of claim 1~9 are utilized
Whether identify the Web page is the open source software homepage page.
11. the open source software resource acquiring method of a kind of web oriented, it is characterised in that comprise the following steps:
1) the open source software homepage page identification method of web oriented according to claim 10 is identified in Web page
The open source software homepage page;
2) open source software resource is searched in the open source software homepage page identified.
12. the open source software resource acquiring method of web oriented according to claim 11, it is characterised in that under also including
Row step:
3) by step 2) the open source software resource downloading that is found is to local.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010609743.0A CN102541913B (en) | 2010-12-15 | 2010-12-15 | VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010609743.0A CN102541913B (en) | 2010-12-15 | 2010-12-15 | VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102541913A CN102541913A (en) | 2012-07-04 |
CN102541913B true CN102541913B (en) | 2017-10-03 |
Family
ID=46348830
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010609743.0A Expired - Fee Related CN102541913B (en) | 2010-12-15 | 2010-12-15 | VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102541913B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103078897B (en) * | 2012-11-29 | 2015-11-18 | 中山大学 | A kind of system realizing Web service fine grit classification and management |
CN103226509B (en) * | 2013-04-08 | 2016-03-30 | 上海华力微电子有限公司 | A kind of method of system journal automatic analysis |
CN110188536B (en) * | 2019-05-22 | 2021-04-20 | 北京邮电大学 | Application program detection method and device |
CN110990035B (en) * | 2019-11-01 | 2023-03-14 | 中国人民解放军63811部队 | Chain type software upgrading method based on Git |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1719436A (en) * | 2004-07-09 | 2006-01-11 | 中国科学院自动化研究所 | A kind of method and device of new proper vector weight towards text classification |
CN101055621A (en) * | 2006-04-10 | 2007-10-17 | 中国科学院自动化研究所 | Content based sensitive web page identification method |
CN101281521A (en) * | 2007-04-05 | 2008-10-08 | 中国科学院自动化研究所 | Method and system for filtering sensitive web page based on multiple classifier amalgamation |
-
2010
- 2010-12-15 CN CN201010609743.0A patent/CN102541913B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1719436A (en) * | 2004-07-09 | 2006-01-11 | 中国科学院自动化研究所 | A kind of method and device of new proper vector weight towards text classification |
CN101055621A (en) * | 2006-04-10 | 2007-10-17 | 中国科学院自动化研究所 | Content based sensitive web page identification method |
CN101281521A (en) * | 2007-04-05 | 2008-10-08 | 中国科学院自动化研究所 | Method and system for filtering sensitive web page based on multiple classifier amalgamation |
Non-Patent Citations (1)
Title |
---|
面向软件开发信息库的数据挖掘综述;白洁,李春平;《计算机应用研究》;20080131;第25卷(第1期);第1节、第3.1.1节 * |
Also Published As
Publication number | Publication date |
---|---|
CN102541913A (en) | 2012-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8484245B2 (en) | Large scale unsupervised hierarchical document categorization using ontological guidance | |
US20060288275A1 (en) | Method for classifying sub-trees in semi-structured documents | |
CN109933660B (en) | API information search method towards natural language form based on handout and website | |
Babur et al. | Hierarchical clustering of metamodels for comparative analysis and visualization | |
US9251245B2 (en) | Generating mappings between a plurality of taxonomies | |
López et al. | ModelSet: a dataset for machine learning in model-driven engineering | |
CN103778206A (en) | Method for providing network service resources | |
Chen et al. | Recommending software features for mobile applications based on user interface comparison | |
CN102541913B (en) | VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented | |
Meusel et al. | Towards automatic topical classification of LOD datasets | |
CN116108191A (en) | Deep learning model recommendation method based on knowledge graph | |
Sara-Meshkizadeh et al. | Webpage classification based on compound of using HTML features & URL features and features of sibling pages | |
Yang et al. | User story clustering in agile development: a framework and an empirical study | |
Kumar et al. | A systematic review of semantic clone detection techniques in software systems | |
CN116894495A (en) | Method, computer readable medium and system for enhancing machine learning pipeline with annotations | |
Sun et al. | A scenario model aggregation approach for mobile app requirements evolution based on user comments | |
Park et al. | Extracting search intentions from web search logs | |
Kovacevic et al. | Providing answers to questions from automatically collected web pages for intelligent decision making in the construction sector | |
De Bonis et al. | Graph-based methods for Author Name Disambiguation: a survey | |
Suresh et al. | A fuzzy based hybrid hierarchical clustering model for twitter sentiment analysis | |
Velloso et al. | Web page structured content detection using supervised machine learning | |
ElGindy et al. | Capturing place semantics on the geosocial web | |
Lamba et al. | Predictive Modeling | |
Xiao et al. | Listening to the crowd for the change file localization of mobile apps | |
CN109299381A (en) | A kind of software defect retrieval and analysis system and method based on semantic concept |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171003 Termination date: 20201215 |