iT邦幫忙

0

想透過Python移除CSV檔案內的Emoji表情符號或特殊字符

  • 分享至 

  • xImage

如題目,目前正學習資料爬蟲!
想要將爬蟲的結果文字檔案(CSV)進行文字情緒分析~
但是文件內容含有大量Emoji導致情緒辨識無法執行!
故希望找到事前清洗Emoji的方法!
目前嘗試程式碼如下,但是還是無法順利執行,懇求前輩們指點迷津!


import re
import pandas as pd
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U00010000-\U0010ffff"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))

data = pd.read_csv('test.csv',encoding='utf-8', sep='\t') # read tsv file

data = pd.read_csv('test.csv',encoding='utf-8') read csv file

data[u'header'] = data[u'header'].astype(str)
data[u'header'] = data[u'header'].apply(lambda x:demoji(x))
data.to_csv('output.csv',index=False, encoding='utf-8') # save to csv file


錯誤顯示為KeyError: 'header'
才疏學淺沒辦法獨立除錯,再麻煩有經驗的前輩協助,感激不盡~

看更多先前的討論...收起先前的討論...
froce iT邦大師 1 級 ‧ 2022-05-25 08:46:38 檢舉
KeyError: 'header' 是指你的data裡面沒有header這欄位。

debug訊息要學著看學著google。
你好,我了解是沒有設立那個欄位,因為程式碼是網路搜尋的~
我的問題是我試著改變那個欄位的名稱,但是還是無法成功,故上來提問!
hokou iT邦好手 1 級 ‧ 2022-05-25 11:43:55 檢舉
應該說要看你的 test.csv 有哪些欄位
你想要對那些欄位清除掉 emoji,再進行修改
你好,我的test.csv檔案內,只保留留言資料,沒有多餘的欄位~
也就是希望透過讀取這份整理過的test.csv把檔案內的emoji清除掉! 感謝~~
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

2 個回答

2
I code so I am
iT邦高手 1 級 ‧ 2022-05-26 06:37:26
最佳解答

程式基本上沒問題,只是忘記建立欄位名稱,我試作一次如下:

import re
import pandas as pd
def demoji(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F" # emoticons
        u"\U0001F300-\U0001F5FF" # symbols & pictographs
        u"\U0001F680-\U0001F6FF" # transport & map symbols
        u"\U0001F1E0-\U0001F1FF" # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U00010000-\U0010ffff"
        "]+", flags=re.UNICODE)
    return(emoji_pattern.sub(r'', text))

with open('data.txt', 'r',encoding='utf-8') as f:
    text = f.read()

data = pd.DataFrame({'header':[text]})
#data = pd.read_csv('data.txt',encoding='utf-8') 

data['header'] = data['header'].apply(lambda x:demoji(x))
data.to_csv('output.csv',index=False, encoding='utf-8') 

data.txt內容:

In case you’ve been sleeping for the past twenty years, emoji usage has been going ???. By mid-2015, half of all comments on Instagram included an emoji. Hollywood released a full feature-length film titled The Emoji Movie. Even Google’s CEO Sundar Pichai is posting about urgent fixes to the hamburger emoji.

For some, emoji have caused frustration for users (how the heck are you supposed to use the ? emoji?). Yet for many others, emoji has opened up a fascinating new medium of communication. There are even emoji charade-esque “games” where users can guess a movie title based on a series of emoji. (try these: ?? or ???⚡). But what happens when you push emoji a step further?

您好,對於您的幫助,感激不盡!!
程式確實有幫助,也簡單易懂能修改調整,謝謝您的無私分享~

讚!

0
shaketweet
iT邦見習生 ‧ 2024-08-19 16:08:37

Really good question for users to understand all the features. Unique and impressive geometry dash to research the source effectively.

我要發表回答

立即登入回答