在字串當中抓取單字的方法,依算法分為TF-IDF與TextRank兩種
jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())
找出的關鍵詞會依照詞頻權重排列
>>> s = "我出門去買早餐"
>>> print(jieba.analyse.extract_tags(s, topK=20, withWeight=False, allowPOS=()))
>>> for x, w in jieba.analyse.extract_tags(s, withWeight=True):
... print('%s %s' % (x, w))
['出門', '早餐']
出門 5.97738375145
早餐 4.29868637196
jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns','n','vn','v'))直接使用,接口相同,注意默認過濾詞性。
>>> print(jieba.analyse.textrank(s, withWeight=False))
>>> for x, w in jieba.analyse.textrank(s, withWeight=True):
... print('%s %s' % (x, w))
['早餐', '出門']
早餐 1.0
出門 0.9961264494011037
>>> import jieba.posseg as pseg
>>> words = pseg.cut(s)
>>> for word, flag in words:
... print('%s %s' % (word, flag))
我 r
出門 v
去 v
買 v
早餐 n
jieba.tokenize(u''),字串前面要+u,回傳的值要用for in打開
>>> result = jieba.tokenize(u'他在游泳池唱國歌')
>>> for tk in result:
... print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word 他 start: 0 end:1
word 在 start: 1 end:2
word 游泳池 start: 2 end:5
word 唱國歌 start: 5 end:8
>>> result = jieba.tokenize(u'他在游泳池唱國歌',mode='search')
>>> for tk in result:
... print("word %s\t\t start: %d \t\t end:%d" % (tk[0],tk[1],tk[2]))
word 他 start: 0 end:1
word 在 start: 1 end:2
word 游泳 start: 2 end:4
word 泳池 start: 3 end:5
word 游泳池 start: 2 end:5
word 唱國歌 start: 5 end:8
參考資料: https://github.com/fxsjy/jieba