今天想做一個IT鐵人賽文章查詢程式,之後用來自動同步Blog避免手工重複複製。
連結規則 : https://ithelp.ithome.com.tw/users/用戶ID/ironman/主題系列ID
,如圖片
為了不堵塞程序,使用async異步方式來讀取網頁連結獲取HTML
public async Task<string> GetAsync(string uri)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
using (HttpWebResponse response = (HttpWebResponse)await request.GetResponseAsync())
using (Stream stream = response.GetResponseStream())
using (StreamReader reader = new StreamReader(stream))
{
return await reader.ReadToEndAsync();
}
}
QuerySelectorAll
、QuerySelector
以JQuery選擇器方式篩選DOM。private static readonly HtmlParser _parser = new HtmlParser();
var dom = _parser.Parse("<html></html>").QuerySelector("html");
首先觀察文章內容包含:發布日期、標題、內容、連結
可以使用class='profile-list__content'
抓取文章DOM,如圖片
使用AngleSharp的QuerySelector方法抓取class=qa-list__title下A Tag
DOM,DOM的text內容是文章標題,DOM的href屬性是連結。
接著抓取文章DOM下的class=qa-list__info-time
的title屬性,裡面存放發布時間。
CODE:
var allpost = document.QuerySelectorAll(".profile-list__content");
foreach (var postInfo in allpost){
var post = new Post();
var titleAndLinkDom = postInfo.QuerySelector(".qa-list__title>a");
post.Title = titleAndLinkDom.InnerHtml.Trim(); /*標題*/
post.link = titleAndLinkDom.GetAttribute("href").Trim(); /*連結*/
post.PubDate = DateTime.Parse(postInfo.QuerySelector(".qa-list__info-time").GetAttribute("title").Trim());
}
需要依靠剛剛抓取到的文章連結讀取HTML,抓取class=markdown__style
DOM的InnerHtml就是文章內容了。
private async Task<string> GetPostContentAsync(string posturl)
{
var htmlContent = (await GetAsync(posturl));
var document = _parser.Parse(htmlContent);
return document.QuerySelectorAll(".markdown__style").FirstOrDefault().InnerHtml;
}
public class ITIronManSyncPostService
{
private static readonly HtmlParser _parser = new HtmlParser();
public IList<Post> Posts { get; set; } = new List<Post>();
private string _url { get; set; }
public async static Task<ITIronManSyncPostService> GetITIronManPosts(string url)
{
var itironman = new ITIronManSyncPostService();
itironman._url = url;
await itironman.ExecuteAsync();
return itironman;
}
private async Task ExecuteAsync()
{
//因為IT鐵人賽只需要三十篇文章,每頁10篇文章,抓取頁數取4頁就好
for (int i = 1; i < 4; i++)
await GetITIronManPostsAsync(_url + $"?page={i}");
}
private async Task GetITIronManPostsAsync(string url)
{
var htmlContent = (await GetAsync(url));
var document = _parser.Parse(htmlContent);
//獲取鐵人賽主題
var article = document.QuerySelector(".qa-list__title--ironman");
article.RemoveChild(article.QuerySelector("span"));/*移除系列文字*/
var articleText = article.TextContent.Trim();
//獲取鐵人賽:發布日期、標題、內容、連結
var allpost = document.QuerySelectorAll(".profile-list__content");
foreach (var postInfo in allpost)
{
var post = new Post();
var titleAndLinkDom = postInfo.QuerySelector(".qa-list__title>a");
post.Title = titleAndLinkDom.InnerHtml.Trim();
post.link = titleAndLinkDom.GetAttribute("href").Trim();
post.Content = GetPostContentAsync(post.link).Result.Trim();
post.PubDate = DateTime.Parse(postInfo.QuerySelector(".qa-list__info-time").GetAttribute("title").Trim());
post.Article = articleText;
Posts.Add(post);
}
}
private async Task<string> GetPostContentAsync(string posturl)
{
var htmlContent = (await GetAsync(posturl));
var document = _parser.Parse(htmlContent);
return document.QuerySelectorAll(".markdown__style").FirstOrDefault().InnerHtml;
}
public async Task<string> GetAsync(string uri)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
using (HttpWebResponse response = (HttpWebResponse)await request.GetResponseAsync())
using (Stream stream = response.GetResponseStream())
using (StreamReader reader = new StreamReader(stream))
{
return await reader.ReadToEndAsync();
}
}
public class Post
{
public string Title { get; set; }
public string link { get; set; }
public string Content { get; set; }
public string Article { get; set; }
public DateTime PubDate { get; set; }
}
}
明天完成跟MiniBlog的串接,今天先到這邊。