Python使用AI人工智能技术对内容自动分类

xianbin

浏览: 212852 次
来自: ...

最近访客更多访客>>

fanying

vip1225335417

sf_dream

猪猪猪1111

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

人工智能

python

2017年下半年有一段时间因为工作涉及AI人工智能，曾经短时间研究过，本文只是初步的研究成果，纯粹是抛砖引玉。

之前文章介绍了网络爬虫，实际上，也会AI有密切关系，因为AI在进行智能分析的之前，需要对数据进行建模，因此通过爬虫技术，在网络上获取建模数据可以提升AI处理的效率和准确性。

下面先对业务需求进行描述：假设需要对用户提问的疾病问题进行自动分类，比如呼吸科、心内科、消化内科等，自动归集起来。

处理步骤为：
1、先爬取部分医药网站的归类问题
2、使用AI对这些问题进行训练
3、通过输入某类疾病问题，验证识别效果

一、数据爬取
本示例使用的是“问医生”（https://www.jiankang.com）网站的数据，会将每个问题内容爬取到单独的文件中。

二、数据处理代码

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nerutils import *
from sklearn.linear_model import SGDClassifier

# 选取参与分析的文本类别
categories = ['呼吸内科', '心内科', '消化内科']

train_path='category/train'

# 从硬盘获取原始数据
twenty_train=load_files(train_path,
        categories=categories,
        load_content = True,
        encoding='utf-8',
        decode_error='strict',
        shuffle=True, random_state=42)
# 统计词语出现次数
count_vect = CountVectorizer()

for index in range(len(twenty_train.data)):
        twenty_train.data[index] = ' '.join(ner( twenty_train.data[index]))

from sklearn.pipeline import Pipeline
# 建立Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge',
                                            penalty='l2',
                                            alpha=1e-3,
                                            n_iter=5,
                                            random_state=42)),
])

# 训练分类器
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)
# 打印分类器信息
print(text_clf)

# 读取测试数据
categories = ['呼吸内科']

test_path = 'category/test'

test_train=load_files(test_path,
        categories=categories,
        load_content = True,
        encoding='utf-8',
        decode_error='strict',
        shuffle=True, random_state=42)

for index in range(len(test_train.data)):
        test_train.data[index] = ' '.join(ner( test_train.data[index]))

test_train.target = [0]*len(test_train.target)

docs_test = test_train.data

# 使用测试数据进行分类预测
predicted = text_clf.predict(docs_test)
print("分类数据：" + str(predicted))
score = text_clf.score

# 计算预测结果的准确率
import numpy as np
print("准确率为：")
print(np.mean(predicted == test_train.target) * 100)

下面是测试输出的结果，准确率100%，很意外！

分类数据：[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0]
准确率为：
100.0

因为该工作只持续了一个月左右，所以后续没有更深层的应用，不过就个人行业经验来看，AI对于很多方面确实有非常大的补充，单就这个分类来说，可以使用的业务范围非常多，比如一个汽车调研项目，需要从各类网站收集汽车信息，然后进行归类，可以按照排量、质量、发动机等等，通过AI预先将信息进行分类，然后再进行BI处理。

其他更多应用，欢迎各位朋友参与讨论。