PHP正規劃過濾Excel轉html的table

php

chan15 2010-05-28 17:22:11 ‧ 9874 瀏覽

分享至

客戶把excel轉成html頁面，希望取裡面的table來使用
用file_get_contents取html的內容的話，會有一堆CSS跟JavaScript
所以我先使用preg_match取出我要的那個table
程式如下

$file = 'xxx.html';
$load = iconv('big5', 'utf-8', file_get_contents($file));
preg_match('/&lt;table.*>.*&lt;\/table>/is', $load, $arr);
$table = $arr[0];

但table裡面還會有一些奇怪的東西
像是<col width="76" style="width: 57pt;" class="xl65">
我想要清掉這些東西不知道該怎麼做

$table = preg_replace('/&lt;col.*>/is', '', $table);

這樣是失敗的

登入發表討論

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

2 個回答

逮丸逮丸

iT邦大師 1 級 ‧ 2010-05-28 20:19:49

最佳解答

perl 有非常好用的 HTML-Scrubber 的套件。
用實際例子示範：
擷取這個xls 檔案轉成 slx.html測試：

&lt;pre class="c" name="code">
&lt;table border=0 cellpadding=0 cellspacing=0 width=209 style='border-collapse:
 collapse;table-layout:fixed;width:157pt'>
 &lt;col width=54 style='mso-width-source:userset;mso-width-alt:1728;width:41pt'>
 &lt;col width=60 style='mso-width-source:userset;mso-width-alt:1920;width:45pt'>
 &lt;col width=95 style='mso-width-source:userset;mso-width-alt:3040;width:71pt'>
 &lt;tr class=xl65 height=22 style='height:16.5pt'>
  &lt;td height=22 class=xl66 width=54 style='height:16.5pt;width:41pt'>姓名&lt;/td>
  &lt;td class=xl67 width=60 style='width:45pt'>學號&lt;/td>
  &lt;td class=xl67 width=95 style='width:71pt'>身分證字號&lt;/td>
 &lt;/tr>
 &lt;tr height=22 style='height:16.5pt'>
  &lt;td height=22 class=xl68 width=54 style='height:16.5pt;width:41pt'>溫志鴻&lt;span
  style='mso-spacerun:yes'> &lt;/span>&lt;/td>
  &lt;td class=xl69 align=right width=60 style='width:45pt'>880001&lt;/td>
  &lt;td class=xl69 width=95 style='width:71pt'>J000000001&lt;/td>
 &lt;/tr>
 &lt;tr height=22 style='height:16.5pt'>
  &lt;td height=22 class=xl68 width=54 style='height:16.5pt;width:41pt'>劉泓廷&lt;span
  style='mso-spacerun:yes'> &lt;/span>&lt;/td>
  &lt;td class=xl69 align=right width=60 style='width:45pt'>880002&lt;/td>
  &lt;td class=xl69 width=95 style='width:71pt'>J000000002&lt;/td>
 &lt;/tr>
&lt;tr height=0 style='display:none'>
  &lt;td width=54 style='width:41pt'>&lt;/td>
  &lt;td width=60 style='width:45pt'>&lt;/td>
  &lt;td width=95 style='width:71pt'>&lt;/td>
 &lt;/tr>
 &lt;![endif]>
&lt;/table>

編個簡單的 perl script 名叫 clean_html.l檔：

&lt;pre class="c" name="code">#!/usr/bin/perl
open(F,$ARGV[0]);
@file =  &lt;F>;
close F;

use HTML::Scrubber;

foreach $line (@file) {
 $html .= $line;
}
my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );;
my $clean_html = $scrubber->scrub($html);
print "$clean_html";

執行 perl clean_html.pl xls.html 跑出的結果變這樣：

&lt;pre class="c" name="code">&lt;table>



 &lt;tr>
  &lt;td>姓名&lt;/td>
  &lt;td>學號&lt;/td>
  &lt;td>身分證字號&lt;/td>
 &lt;/tr>
 &lt;tr>
  &lt;td>溫志鴻 &lt;/td>
  &lt;td>880001&lt;/td>
  &lt;td>J000000001&lt;/td>
 &lt;/tr>
 &lt;tr>
  &lt;td>劉泓廷 &lt;/td>
  &lt;td>880002&lt;/td>
  &lt;td>J000000002&lt;/td>
 &lt;/tr>
&lt;tr>
  &lt;td>&lt;/td>
  &lt;td>&lt;/td>
  &lt;td>&lt;/td>
 &lt;/tr>

&lt;/table>

程式碼若第10行若改為
$scrubber->default(0);
則是把所有 html 的標籤去掉。
所以那第十行的 allow 可定你要留的標籤。

也可參考使用範例 Strip/sanitize HTML with Perl
這個 HTML::Scrubber 在linux很容易裝，
在 Fedora/RedHat/CentOS 類，
yum install perl-HTML-Scrubber
在 Ubuntu 類：
aptitude install libhtml-scrubber-perl
就裝起來了。

若有大量檔案轉，就可先用這簡單的 script 幫你保留只要的標籤，
然後再用你熟悉的程式來處理這些檔案。

回應 7
分享
檢舉

看更多先前的回應...收起先前的回應...

逮丸逮丸 iT邦大師 1 級 ‧ 2010-05-28 20:22:46 檢舉

twtw提到：
程式碼若第10行若改為

應是第11行。

逮丸逮丸 iT邦大師 1 級 ‧ 2010-05-28 20:28:52 檢舉

那個 clean_html.pl 檔簡化成這樣就可以了：

&lt;pre class="c" name="code">#!/usr/bin/perl
use HTML::Scrubber;
my $scrubber = HTML::Scrubber->new( allow => [ qw[ table tr td ] ] );
my $clean_html = $scrubber->scrub_file($ARGV[0]);
print "$clean_html";

chan15 iT邦新手 2 級 ‧ 2010-05-28 21:11:49 檢舉

我是問PHP說 =3=

逮丸逮丸 iT邦大師 1 級 ‧ 2010-05-28 22:11:52 檢舉

如果要純php的話，就得試試 Pear的HTML_Safe
在回答的補充說明裡的一個討論連結，有人提這可做到 perl 的 HTML::Scrubber 的功能。

php 的強項在於做網頁程式互動的處理，
至於這種解析較麻煩的資料，不必拘泥在只能用php，
perl 提供的解決方案，相較之下是更簡單讓你達到目的。

可以用php的exec 來執行這外部的 perl script。
可能類似這樣的寫法，就得到上述範例的乾淨table了：

&lt;pre class="c" name="code">
$table = exec('perl clean_html.pl $file');

就省掉了很多用 php 的 regular expression 從造輪子開始的辛苦。

逮丸逮丸 iT邦大師 1 級 ‧ 2010-05-29 08:27:23 檢舉

如果您不方便在linux平臺操作的話，
可裝 windows 版的 perl。
只要下載安裝這個安裝檔，
然後後執行

&lt;pre class="c" name="code">ppm install HTML::Scrubber

就可把該套件裝上。

chan15 iT邦新手 2 級 ‧ 2010-05-31 21:43:06 檢舉

雖然順利將table取進來，可是Excel產生的html製造出許多廢tr（2000多個）
所以在client使用js重組表格時遇到了效能上的問題
希望詢問某些內容是否可以在Server端使用正規式做掉

某個td包含某個關鍵字時，將該tr加上某class
其實該表格只有70多個tr有用，不知道為何Excel要產生2000多個空tr，最底下的tr叫做總計，不知道是否能夠將總計td以下的tr全部清掉

逮丸逮丸 iT邦大師 1 級 ‧ 2010-05-31 22:43:28 檢舉

如果有些具體的少許實例，不涉及貴單位敏感內容，
比較容易推想可能的處理方向。

就1.而言，如果 <tr><td>xxx</td></tr>這麼單純形式的話，
regex就很容易處理。
但實際可能是 tr, td, xxx 都錯開而沒在同一行；
這時就可能需要看看能不能用 tidy 整理整齊後，
方便 regex 處理。
或者，要有像 jQuery 可以 parse HTML 的工具，
在 ruby 像 Hpricot、
在 perl 裡一些的 Parsing HTML的模組，
或者以 DOM 的角度去處理，會比單靠 regex 處理來得有效。
php 的 PHP Simple HTML DOM Parser 也許可以符合你在 Server 端處理的要求。

就2.而言，
檔案複製出來，自行邊砍邊測試。
如果所有的檔案的行數都一致、
格式一致，只有關鍵數值不同的話，
就更容易以「行」為單位來處理。

登入發表回應

fillano

iT邦超人 1 級 ‧ 2010-05-28 17:38:44

Excel？幹麼不轉成CSV格式咧...這樣就不用跟微軟自訂的一堆html tag打交道了。

回應 6
分享
檢舉

看更多先前的回應...收起先前的回應...

chan15 iT邦新手 2 級 ‧ 2010-05-28 17:39:32 檢舉

我正在跟他們打交道…
客戶流程沒辦法改變，只能從中想辦法

chan15 iT邦新手 2 級 ‧ 2010-05-28 18:12:30 檢舉

有熱心的網友提供變通的方式
用strip_tags($str, '<table><tr><td><img>')
這樣已經滿足我第一階段的需求
接著要用正規化清除所有的style跟找出該字串所有符合img的標籤

小成 iT邦高手 10 級 ‧ 2010-05-28 19:37:40 檢舉

正規表示式~每次寫完以後每次都馬上把他丟掉XD~
每次寫的時候又要重新看ㄧ次XD

chan15 iT邦新手 2 級 ‧ 2010-05-28 21:11:08 檢舉

跟我一樣 XD

fillano iT邦超人 1 級 ‧ 2010-05-28 23:24:55 檢舉

這樣喔...

那你用tidy看看，裡面有簡單的範例。不過<col />是合法的html tag，tidy不會幫你清掉。通常要配合一下設定，用陣列傳給parseString方法，可以用的設定可以參考http://tidy.sourceforge.net/docs/quickref.html

通常預設是不會開啟tidy的，請在php.ini打開他。另外，可以用show-body-only選項讓他只顯示body tag之內的內容，減少後續處理的麻煩。例如：

&lt;pre class="c" name="code">
&lt;?php
$html = 'html內容字串';
$config = array(
           'bare'           => true,
           'show-body-only' => true,
           'indent'         => true,
           'output-html'    => true,
           'wrap'           => 200);
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
echo $tidy;
?>

（拿範例稍微改一下）