在解析pdf檔的內文時,使用了tabula + pandas模組的concat
但在解析concat出來的值時出現KeyError
#pdf 轉文字
try:
df=tabula.read_pdf(input_path=pdffile, output_format='dataframe', lattice=True, multiple_tables=True, pages="all", guess=False,
encoding='utf-8')
df = pd.concat(df)
except Exception as e:
print(e)
df.to_csv("outing.csv", index = None, header=True)
get_sh_content(df)
#讀取
def get_sh_content(df):
ans = []
print(len(df.index))
df.fillna(0)
print(df[df.columns[0]][len(df.index)])
會出現error:
File "test.py", line 44, in get_sh_content
print(df[df.columns[0]][len(df.index)])
File "C:\Users\user\Anaconda3\lib\site-packages\pandas\core\series.py", line 1068, in __getitem__
result = self.index.get_value(self, key)
File "C:\Users\user\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 4730, in get_value
return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
File "pandas\_libs\index.pyx", line 80, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 88, in pandas._libs.index.IndexEngine.get_value
File "pandas\_libs\index.pyx", line 126, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 152, in pandas._libs.index.IndexEngine._get_loc_duplicates
File "pandas\_libs\index_class_helper.pxi", line 122, in pandas._libs.index.Int64Engine._maybe_get_bool_indexer
KeyError: 5237
請問是因為當中有NA值還是其他問題會造成呢?
嘗試將資料輸出成csv檔查看,發現空值很多,試過用fillna替換,也沒辦法解決。
懇求會的大大提點!
我的 error 是說TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
我這樣是有執行成功的
df = tabula.read_pdf(input_path=pdffile, output_format='dataframe', pages="all", guess=False,encoding='utf-8')
res = df.values.tolist()
你要把 dataframe 轉 list 的話 參考df.values.tolist()
就好
你的問題應該不只上面那些 你下面讀取的那段我也看不懂你要幹嘛