下载整理中国哲学电子书的脚本|Jerkwin

2017-11-21 10:40:11

我喜欢将一些书的文本存在手机上, 这样可以利用碎片化的时间随时阅读几句. 中国的古文最适合这样的目的, 因为大多很短, 且言简意赅. 中国哲学书电子化计划网站上有很多整理好的古文文本, 但保存不方便, 所以就想能不能自动下载整理呢? 根据网站的地址和格式分析了一下, 似乎可行. 脚本如下:

ctxt.bsh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43	# url="http://ctext.org/analects/zhs"; trs=1 # url="http://ctext.org/mengzi/zhs"; trs=0 # url="http://ctext.org/xunzi/zhs"; trs=0 # url="http://ctext.org/kongzi-jiayu/zhs" # url="http://ctext.org/shi-shuo-xin-yu/zhs" # url="http://ctext.org/yan-shi-jia-xun/zhs" # url="http://ctext.org/zhuangzi/inner-chapters/zhs" # url="http://ctext.org/book-of-changes/zhs" url="http://ctext.org/shang-shu/zhs"; trs=0 export LANG=$(locale -uU) # 设定中文支持 curl $url > _chp awk ' BEGIN{system("rm -rf _ctx")} /　　<a class="menuitem"/ { sub(/.href=\"/, "") sub(/\"./, "") url="http://ctext.org/"$0 print url system("curl "url" >>_ctx") } ' _chp awk -v trs=$trs ' BEGIN {chp=0; tot=0} /<div id="content3"/ { gsub(/^[^《]+《/,"《") gsub(/》.+/,"》") chp++; sec=0 print "第 "chp" 章　"$0"\n" } /<div id="comm[0-9]+"/ { gsub(/<[^>]+>/,"") sec++; if(trs) { if(sec%2) print chp"."(sec+1)/2, $0 else {tot++; print tot"　"$0"\n"} } else { tot++; print chp"."sec"/"tot, $0"\n" } } ' _ctx > _ctx.txt

ctxt.bsh

# url="http://ctext.org/analects/zhs"; trs=1
# url="http://ctext.org/mengzi/zhs";   trs=0
# url="http://ctext.org/xunzi/zhs";  trs=0
# url="http://ctext.org/kongzi-jiayu/zhs"
# url="http://ctext.org/shi-shuo-xin-yu/zhs"
# url="http://ctext.org/yan-shi-jia-xun/zhs"
# url="http://ctext.org/zhuangzi/inner-chapters/zhs"
# url="http://ctext.org/book-of-changes/zhs"
url="http://ctext.org/shang-shu/zhs"; trs=0

export LANG=$(locale -uU) # 设定中文支持

curl $url > _chp

awk ' BEGIN{system("rm -rf _ctx")}
/　　<a class="menuitem"/ {
	sub(/.*href=\"/, "")
	sub(/\".*/, "")
	url="http://ctext.org/"$0
	print url
	system("curl "url" >>_ctx")
}
' _chp

awk -v trs=$trs ' BEGIN {chp=0; tot=0}
/<div id="content3"/ {
	gsub(/^[^《]+《/,"《")
	gsub(/》.+/,"》")
	chp++; sec=0
	print "第 "chp" 章　"$0"\n"
}
/<div id="comm[0-9]+"/ {
	gsub(/<[^>]+>/,"")
	sec++;
	if(trs) {
		if(sec%2) print chp"."(sec+1)/2, $0
		else {tot++; print tot"　"$0"\n"}
	} else {
		tot++;
		print chp"."sec"/"tot, $0"\n"
	}
}
' _ctx > _ctx.txt

可惜的是, 这个网站明言请注意：严禁使用自动下载软体下载本网站的大量网页，违者自动封锁，不另行通知。根据我的测试, 还确实是这样, 下载量大了之后自动封IP. 要想解决的话, 只能自动换IP或者慢慢下载了.