CN103902703A - 基于移动互联网访问的文本内容分类方法 - Google Patents
基于移动互联网访问的文本内容分类方法 Download PDFInfo
- Publication number
- CN103902703A CN103902703A CN201410126495.2A CN201410126495A CN103902703A CN 103902703 A CN103902703 A CN 103902703A CN 201410126495 A CN201410126495 A CN 201410126495A CN 103902703 A CN103902703 A CN 103902703A
- Authority
- CN
- China
- Prior art keywords
- url
- knowledge
- reasoning
- feature
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000004140 cleaning Methods 0.000 claims abstract description 48
- 238000001914 filtration Methods 0.000 claims abstract description 8
- 230000008878 coupling Effects 0.000 claims description 65
- 238000010168 coupling process Methods 0.000 claims description 65
- 238000005859 coupling reaction Methods 0.000 claims description 65
- 238000004458 analytical method Methods 0.000 claims description 28
- 238000004806 packaging method and process Methods 0.000 claims description 24
- 230000003542 behavioural effect Effects 0.000 claims description 13
- 238000012790 confirmation Methods 0.000 claims description 12
- 238000012217 deletion Methods 0.000 claims description 9
- 230000037430 deletion Effects 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 abstract description 5
- 238000012549 training Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 6
- 244000089409 Erythrina poeppigiana Species 0.000 description 5
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 5
- 241001269238 Data Species 0.000 description 4
- 230000003203 everyday effect Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000004611 spectroscopical analysis Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000013011 mating Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
索引值 | Hash列表中的“完整URl”清洗规则 | 类别 | 置信度 |
0 | Entry=222.186.14.3/ | 搜索引擎 | 5.78% |
1 | Entry=mob.3g.cn/sorry/404/error.html | 错误 | 4.96% |
2 | Entry=222.186.14.5/ | 搜索引擎 | 4.52% |
3 | Entry=mob.3g.cn/sorry/404/404.wml | 错误 | 3.89% |
4 | Entry=www.umeng.com/check_config_update | 软件升级 | 3.57% |
…… |
索引值 | Hash列表中的“一级域名”清洗规则 | 置信度 |
0 | Entry=qq.com | 9.25% |
1 | Entry=cnzz.net | 8.36% |
2 | Entry=***.com | 7.25% |
3 | Entry=taobao.com | 4.37% |
4 | Entry5=qlogo.cn | 3.58% |
…… |
索引值 | Hash列表中的“完整URL”内容分类规则 | 类别 | 置信度 |
0 | launcher.warcraftchina.com/2.0/?locale=zh-CN | 网络游戏 | 3.15% |
1 | www.222tk.com/ | 彩票 | 2.87% |
2 | street.yoka.com/clockbeauty/ | 时尚 | 2.45% |
3 | 3g.eastmoney.com/Money.aspx | 财经 | 1.67% |
4 | house.lsfc.net.cn/sell_info.asp?id=1097356 | 房产 | 1.54% |
…… |
索引值 | Hash列表中的“一级域名”内容分类规则 | 置信度 |
0 | Entry=sina.com.cn | 4.32% |
1 | Entry=sohu.com | 3.98% |
2 | Entry=ifeng.com | 3.45% |
3 | Entry=sina.cn | 2.65% |
4 | Entry=qidian.cn | 2.14% |
…… |
索引 | Hash列表中的“完整域名”内容分类规则 | 类别 | 置信度 |
值 | |||
0 | Entry=Sports.sina.com.cn | 体育 | 2.25% |
1 | Entry=news.tianya.cn | 论坛 | 2.04% |
2 | Entry=news.sohu.com | 新闻 | 1.85% |
…… |
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410126495.2A CN103902703B (zh) | 2014-03-31 | 2014-03-31 | 基于移动互联网访问的文本内容分类方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410126495.2A CN103902703B (zh) | 2014-03-31 | 2014-03-31 | 基于移动互联网访问的文本内容分类方法 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103902703A true CN103902703A (zh) | 2014-07-02 |
CN103902703B CN103902703B (zh) | 2016-02-10 |
Family
ID=50994025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410126495.2A Active CN103902703B (zh) | 2014-03-31 | 2014-03-31 | 基于移动互联网访问的文本内容分类方法 |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103902703B (zh) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105117436A (zh) * | 2015-08-10 | 2015-12-02 | 上海晶赞科技发展有限公司 | 网站频道自动挖掘方法 |
CN105528351A (zh) * | 2014-09-29 | 2016-04-27 | 中国电信股份有限公司 | 一种移动终端获取互联网信息的内容去重方法及*** |
CN105930444A (zh) * | 2016-04-20 | 2016-09-07 | 广州精点计算机科技有限公司 | 一种互联网用户分群方法及*** |
CN105956002A (zh) * | 2016-04-20 | 2016-09-21 | 广州精点计算机科技有限公司 | 一种基于url分析的网页分类方法及装置 |
CN106161352A (zh) * | 2015-03-31 | 2016-11-23 | 阿里巴巴集团控股有限公司 | 一种匹配方法和客户端,服务器以及匹配设备 |
CN106294861A (zh) * | 2016-08-23 | 2017-01-04 | 武汉烽火普天信息技术有限公司 | 面向大规模数据的情报***中文本聚合及展现方法及*** |
CN109241274A (zh) * | 2017-07-04 | 2019-01-18 | 腾讯科技(深圳)有限公司 | 文本聚类方法及装置 |
CN109739849A (zh) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | 一种数据驱动的网络敏感信息挖掘与预警平台 |
CN110008340A (zh) * | 2019-03-27 | 2019-07-12 | 曲阜师范大学 | 一种多源文本知识表示、获取与融合*** |
CN110460592A (zh) * | 2019-07-26 | 2019-11-15 | 杭州吉讯汇通科技有限公司 | Url分析方法、装置、设备及介质 |
CN111258969A (zh) * | 2018-11-30 | 2020-06-09 | ***通信集团浙江有限公司 | 一种互联网访问日志解析方法及装置 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103838886A (zh) * | 2014-03-31 | 2014-06-04 | 辽宁四维科技发展有限公司 | 基于代表词知识库的文本内容分类方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270384A1 (en) * | 2007-04-28 | 2008-10-30 | Raymond Lee Shu Tak | System and method for intelligent ontology based knowledge search engine |
CN101593200A (zh) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
CN103136372A (zh) * | 2013-03-21 | 2013-06-05 | 陕西通信信息技术有限公司 | 网络可信性行为管理中url快速定位、分类和过滤方法 |
-
2014
- 2014-03-31 CN CN201410126495.2A patent/CN103902703B/zh active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080270384A1 (en) * | 2007-04-28 | 2008-10-30 | Raymond Lee Shu Tak | System and method for intelligent ontology based knowledge search engine |
CN101593200A (zh) * | 2009-06-19 | 2009-12-02 | 淮海工学院 | 基于关键词频度分析的中文网页分类方法 |
CN103136372A (zh) * | 2013-03-21 | 2013-06-05 | 陕西通信信息技术有限公司 | 网络可信性行为管理中url快速定位、分类和过滤方法 |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105528351A (zh) * | 2014-09-29 | 2016-04-27 | 中国电信股份有限公司 | 一种移动终端获取互联网信息的内容去重方法及*** |
CN106161352A (zh) * | 2015-03-31 | 2016-11-23 | 阿里巴巴集团控股有限公司 | 一种匹配方法和客户端,服务器以及匹配设备 |
CN105117436B (zh) * | 2015-08-10 | 2018-03-30 | 上海晶赞科技发展有限公司 | 网站频道自动挖掘方法 |
CN105117436A (zh) * | 2015-08-10 | 2015-12-02 | 上海晶赞科技发展有限公司 | 网站频道自动挖掘方法 |
CN105930444A (zh) * | 2016-04-20 | 2016-09-07 | 广州精点计算机科技有限公司 | 一种互联网用户分群方法及*** |
CN105956002A (zh) * | 2016-04-20 | 2016-09-21 | 广州精点计算机科技有限公司 | 一种基于url分析的网页分类方法及装置 |
CN106294861B (zh) * | 2016-08-23 | 2019-08-09 | 武汉烽火普天信息技术有限公司 | 面向大规模数据的情报***中文本聚合及展现方法及*** |
CN106294861A (zh) * | 2016-08-23 | 2017-01-04 | 武汉烽火普天信息技术有限公司 | 面向大规模数据的情报***中文本聚合及展现方法及*** |
CN109241274A (zh) * | 2017-07-04 | 2019-01-18 | 腾讯科技(深圳)有限公司 | 文本聚类方法及装置 |
CN109241274B (zh) * | 2017-07-04 | 2022-01-25 | 腾讯科技(深圳)有限公司 | 文本聚类方法及装置 |
CN111258969A (zh) * | 2018-11-30 | 2020-06-09 | ***通信集团浙江有限公司 | 一种互联网访问日志解析方法及装置 |
CN111258969B (zh) * | 2018-11-30 | 2023-08-15 | ***通信集团浙江有限公司 | 一种互联网访问日志解析方法及装置 |
CN109739849B (zh) * | 2019-01-02 | 2021-06-29 | 山东省科学院情报研究所 | 一种数据驱动的网络敏感信息挖掘与预警平台 |
CN109739849A (zh) * | 2019-01-02 | 2019-05-10 | 山东省科学院情报研究所 | 一种数据驱动的网络敏感信息挖掘与预警平台 |
CN110008340A (zh) * | 2019-03-27 | 2019-07-12 | 曲阜师范大学 | 一种多源文本知识表示、获取与融合*** |
CN110460592A (zh) * | 2019-07-26 | 2019-11-15 | 杭州吉讯汇通科技有限公司 | Url分析方法、装置、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
CN103902703B (zh) | 2016-02-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103902703B (zh) | 基于移动互联网访问的文本内容分类方法 | |
CN103218431B (zh) | 一种能识别网页信息自动采集的*** | |
CN102831199B (zh) | 建立兴趣模型的方法及装置 | |
CN112749284B (zh) | 知识图谱构建方法、装置、设备及存储介质 | |
CN107862022B (zh) | 文化资源推荐*** | |
CN104077402B (zh) | 数据处理方法和数据处理*** | |
CN107220386A (zh) | 信息推送方法和装置 | |
CN103838886A (zh) | 基于代表词知识库的文本内容分类方法 | |
CN113254630B (zh) | 一种面向全球综合观测成果的领域知识图谱推荐方法 | |
CN110912888B (zh) | 一种基于深度学习的恶意http流量检测***和方法 | |
CN109783619A (zh) | 一种数据过滤挖掘方法 | |
CN104391978A (zh) | 用于浏览器的网页收藏处理方法及装置 | |
CN108984514A (zh) | 词语的获取方法及装置、存储介质、处理器 | |
WO2022076885A1 (en) | Systems and methods for tracking data shared with third parties using artificial intelligence-machine learning | |
CN103914534B (zh) | 基于专家***url分类知识库的文本内容分类方法 | |
CN114371946B (zh) | 基于云计算和大数据的信息推送方法及信息推送服务器 | |
CN107086925A (zh) | 一种基于深度学习的互联网流量大数据分析方法 | |
CN113378024A (zh) | 一种基于深度学习面向公检法领域的相关事件识别方法 | |
CN116226494B (zh) | 一种用于信息搜索的爬虫***及方法 | |
CN112269906A (zh) | 网页正文的自动抽取方法及装置 | |
CN103902707B (zh) | 专家***url清洗知识库的“垃圾”内容过滤方法 | |
CN108920492B (zh) | 一种网页分类方法、***、终端及存储介质 | |
CN100357942C (zh) | 一种移动互联网智能信息搜索引擎的搜索方法 | |
CN105930328A (zh) | 异常数据的解析方法及*** | |
CN114912538A (zh) | 信息推送模型训练方法和信息推送方法、装置及设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20151228 Address after: 110020 Shenyang, Liaoning, Tiexi District, No. nine small road 12 3-7-1 Applicant after: Guo Lei Address before: 110043, Dadong Road, Dadong District, Liaoning, 134, two gate, two floor, Shenyang Applicant before: LIAONING SIWEI SCIENCE AND TECHNOLOGY DEVELOPMENTCO., Ltd. |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200110 Address after: 100088 B601, floor 1, building 5, yard 13, Huayuan Road, Haidian District, Beijing Patentee after: Beijing Dongfang Yixin Technology Co.,Ltd. Address before: 110020, No. 12, No. nine, Tiexi Road, Shenyang District, Liaoning, 3-7-1 Patentee before: Guo Lei |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210928 Address after: 1530, Lin 10, No. 84, Wenquan Road, Wenquan Town, Haidian District, Beijing 100095 Patentee after: Beijing yunqi lechuang Technology Co.,Ltd. Address before: 100088 B601, North 1st floor, building 5, yard 13, Huayuan Road, Haidian District, Beijing Patentee before: Beijing Dongfang Yixin Technology Co.,Ltd. |
|
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220224 Address after: 101100 room 252, floor 2, building 7, courtyard 15, Tonghu street, Tongzhou District, Beijing Patentee after: Beijing Zhongding Yixin Technology Co.,Ltd. Address before: 1530, Lin 10, No. 84, Wenquan Road, Wenquan Town, Haidian District, Beijing 100095 Patentee before: Beijing yunqi lechuang Technology Co.,Ltd. |
|
TR01 | Transfer of patent right | ||
PP01 | Preservation of patent right |
Effective date of registration: 20221028 Granted publication date: 20160210 |
|
PP01 | Preservation of patent right |