網(wǎng)站開發(fā)需要多線程嗎視頻優(yōu)化是什么意思
前言
「作者主頁」:雪碧有白泡泡
「個人網(wǎng)站」:雪碧的個人網(wǎng)站
ChatGPT體驗地址
文章目錄
- 前言
- 前言
- 正則表達式
- 進行轉換
- 送書活動
前言
在信息爆炸的時代,互聯(lián)網(wǎng)上的海量文字信息如同無盡的沙灘。然而,其中真正有價值的信息往往埋在各種網(wǎng)頁中,需要經(jīng)過篩選和整理才能被有效利用。幸運的是,Python這個強大的編程語言可以幫助我們完成這項任務。
本文將介紹如何使用Python將網(wǎng)頁文字轉換為Markdown格式,這將使得我們能夠更加方便地閱讀和處理網(wǎng)頁內容。無論是將文章保存為本地文件還是轉化為其他格式,Markdown都能夠提供清晰簡潔的排版和格式,讓我們更加專注于內容本身。
正則表達式
我們將頁面進行Maekdown的轉換為了保證準確度,我們可以使用正則表達式去修改,如下
import re__all__ = ['Tomd', 'convert']MARKDOWN = {'h1': ('\n# ', '\n'),'h2': ('\n## ', '\n'),'h3': ('\n### ', '\n'),'h4': ('\n#### ', '\n'),'h5': ('\n##### ', '\n'),'h6': ('\n###### ', '\n'),'code': ('`', '`'),'ul': ('', ''),'ol': ('', ''),'li': ('- ', ''),'blockquote': ('\n> ', '\n'),'em': ('**', '**'),'strong': ('**', '**'),'block_code': ('\n```\n', '\n```\n'),'span': ('', ''),'p': ('\n', '\n'),'p_with_out_class': ('\n', '\n'),'inline_p': ('', ''),'inline_p_with_out_class': ('', ''),'b': ('**', '**'),'i': ('*', '*'),'del': ('~~', '~~'),'hr': ('\n---', '\n\n'),'thead': ('\n', '|------\n'),'tbody': ('\n', '\n'),'td': ('|', ''),'th': ('|', ''),'tr': ('', '\n')
}BlOCK_ELEMENTS = {'h1': '<h1.*?>(.*?)</h1>','h2': '<h2.*?>(.*?)</h2>','h3': '<h3.*?>(.*?)</h3>','h4': '<h4.*?>(.*?)</h4>','h5': '<h5.*?>(.*?)</h5>','h6': '<h6.*?>(.*?)</h6>','hr': '<hr/>','blockquote': '<blockquote.*?>(.*?)</blockquote>','ul': '<ul.*?>(.*?)</ul>','ol': '<ol.*?>(.*?)</ol>','block_code': '<pre.*?><code.*?>(.*?)</code></pre>','p': '<p\s.*?>(.*?)</p>','p_with_out_class': '<p>(.*?)</p>','thead': '<thead.*?>(.*?)</thead>','tr': '<tr>(.*?)</tr>'
}INLINE_ELEMENTS = {'td': '<td>(.*?)</td>','tr': '<tr>(.*?)</tr>','th': '<th>(.*?)</th>','b': '<b>(.*?)</b>','i': '<i>(.*?)</i>','del': '<del>(.*?)</del>','inline_p': '<p\s.*?>(.*?)</p>','inline_p_with_out_class': '<p>(.*?)</p>','code': '<code.*?>(.*?)</code>','span': '<span.*?>(.*?)</span>','ul': '<ul.*?>(.*?)</ul>','ol': '<ol.*?>(.*?)</ol>','li': '<li.*?>(.*?)</li>','img': '<img.*?src="(.*?)".*?>(.*?)</img>','a': '<a.*?href="(.*?)".*?>(.*?)</a>','em': '<em.*?>(.*?)</em>','strong': '<strong.*?>(.*?)</strong>'
}DELETE_ELEMENTS = ['<span.*?>', '</span>', '<div.*?>', '</div>']class Element:def __init__(self, start_pos, end_pos, content, tag, is_block=False):self.start_pos = start_posself.end_pos = end_posself.content = contentself._elements = []self.is_block = is_blockself.tag = tagself._result = Noneif self.is_block:self.parse_inline()def __str__(self):wrapper = MARKDOWN.get(self.tag)self._result = '{}{}{}'.format(wrapper[0], self.content, wrapper[1])return self._resultdef parse_inline(self):for tag, pattern in INLINE_ELEMENTS.items():if tag == 'a':self.content = re.sub(pattern, '[\g<2>](\g<1>)', self.content)elif tag == 'img':self.content = re.sub(pattern, '', self.content)elif self.tag == 'ul' and tag == 'li':self.content = re.sub(pattern, '- \g<1>', self.content)elif self.tag == 'ol' and tag == 'li':self.content = re.sub(pattern, '1. \g<1>', self.content)elif self.tag == 'thead' and tag == 'tr':self.content = re.sub(pattern, '\g<1>\n', self.content.replace('\n', ''))elif self.tag == 'tr' and tag == 'th':self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))elif self.tag == 'tr' and tag == 'td':self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))else:wrapper = MARKDOWN.get(tag)self.content = re.sub(pattern, '{}\g<1>{}'.format(wrapper[0], wrapper[1]), self.content)class Tomd:def __init__(self, html='', options=None):self.html = htmlself.options = optionsself._markdown = ''def convert(self, html, options=None):elements = []for tag, pattern in BlOCK_ELEMENTS.items():for m in re.finditer(pattern, html, re.I | re.S | re.M):element = Element(start_pos=m.start(),end_pos=m.end(),content=''.join(m.groups()),tag=tag,is_block=True)can_append = Truefor e in elements:if e.start_pos < m.start() and e.end_pos > m.end():can_append = Falseelif e.start_pos > m.start() and e.end_pos < m.end():elements.remove(e)if can_append:elements.append(element)elements.sort(key=lambda element: element.start_pos)self._markdown = ''.join([str(e) for e in elements])for index, element in enumerate(DELETE_ELEMENTS):self._markdown = re.sub(element, '', self._markdown)return self._markdown@propertydef markdown(self):self.convert(self.html, self.options)return self._markdown_inst = Tomd()
convert = _inst.convert
這段代碼是一個用于將HTML轉換為Markdown的工具類。它使用了正則表達式來解析HTML標簽,并根據(jù)預定義的轉換規(guī)則將其轉換為對應的Markdown格式。
代碼中定義了一個Element
類,用于表示HTML中的各個元素。Element
類包含了標簽的起始位置、結束位置、內容、標簽類型等信息。它還提供了一個parse_inline
方法,用于解析內聯(lián)元素,并將其轉換為Markdown格式。
Tomd
類是主要的轉換類,它接受HTML字符串并提供了convert
方法來執(zhí)行轉換操作。convert
方法遍歷預定義的HTML標簽模式,并使用正則表達式匹配HTML字符串中對應的部分。然后創(chuàng)建相應的Element
對象并進行轉換操作。最后,將轉換后的Markdown字符串返回。
在模塊頂部,MARKDOWN
字典定義了各個HTML標簽對應的Markdown格式。BlOCK_ELEMENTS
和INLINE_ELEMENTS
字典定義了正則表達式模式,用于匹配HTML字符串中的塊級元素和內聯(lián)元素。DELETE_ELEMENTS
列表定義了需要刪除的HTML元素。
那么既然有了轉markdown的工具,我們就可以對網(wǎng)頁進行轉換
進行轉換
首先,
result_file
函數(shù)用于創(chuàng)建一個保存結果文件的路徑。它接受文件夾的用戶名、文件名和文件夾名作為參數(shù),并在指定的文件夾路徑下創(chuàng)建一個新的文件,并返回該文件的路徑。
get_headers
函數(shù)用于從一個文本文件中讀取Cookie,并將它們保存為字典形式。它接受包含Cookie的文本文件路徑作為參數(shù)。
delete_ele
函數(shù)用于刪除BeautifulSoup
對象中指定的標簽。它接受一個BeautifulSoup對象和待刪除的標簽列表作為參數(shù),并通過使用該對象的select方法來選擇要刪除的標簽,然后使用decompose
方法進行刪除。
delete_ele_attr
函數(shù)用于刪除BeautifulSoup對象中指定標簽的指定屬性。它接受一個BeautifulSoup對象和待刪除的屬性列表作為參數(shù),并使用find_all
方法來選取所有標簽,然后使用Python的del語句刪除指定的屬性。
delete_blank_ele
函數(shù)用于刪除BeautifulSoup對象中的空白標簽。它接受一個BeautifulSoup對象和一個例外列表,對于不在例外列表中且內容為空的標簽,使用decompose方法進行刪除。
TaskQueue
類是一個簡單的任務隊列,用于存儲已訪問的和未訪問的URL。它提供了一系列方法來操作這些列表。
def result_file(folder_username, file_name, folder_name):folder = os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", folder_name, folder_username)if not os.path.exists(folder):try:os.makedirs(folder)except Exception:passpath = os.path.join(folder, file_name)file = open(path,"w")file.close()else:path = os.path.join(folder, file_name)return pathdef get_headers(cookie_path:str):cookies = {}with open(cookie_path, "r", encoding="utf-8") as f:cookie_list = f.readlines()for line in cookie_list:cookie = line.split(":")cookies[cookie[0]] = str(cookie[1]).strip()return cookiesdef delete_ele(soup:BeautifulSoup, tags:list):for ele in tags:for useless_tag in soup.select(ele):useless_tag.decompose()def delete_ele_attr(soup:BeautifulSoup, attrs:list):for attr in attrs:for useless_attr in soup.find_all():del useless_attr[attr]def delete_blank_ele(soup:BeautifulSoup, eles_except:list):for useless_attr in soup.find_all():try:if useless_attr.name not in eles_except and useless_attr.text == "":useless_attr.decompose()except Exception:passclass TaskQueue(object):def __init__(self):self.VisitedList = []self.UnVisitedList = []def getVisitedList(self):return self.VisitedListdef getUnVisitedList(self):return self.UnVisitedListdef InsertVisitedList(self, url):if url not in self.VisitedList:self.VisitedList.append(url)def InsertUnVisitedList(self, url):if url not in self.UnVisitedList:self.UnVisitedList.append(url)def RemoveVisitedList(self, url):self.VisitedList.remove(url)def PopUnVisitedList(self,index=0):url = []if index and self.UnVisitedList:url = self.UnVisitedList[index]del self.UnVisitedList[:index]elif self.UnVisitedList:url = self.UnVisitedList.pop()return urldef getUnVisitedListLength(self):return len(self.UnVisitedList)class CSDN(object):def __init__(self, username, folder_name, cookie_path):# self.headers = {# "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"# }self.headers = get_headers(cookie_path)self.s = requests.Session()self.username = usernameself.TaskQueue = TaskQueue()self.folder_name = folder_nameself.url_num = 1def start(self):num = 0articles = [None]while len(articles) > 0:num += 1url = u'https://blog.csdn.net/' + self.username + '/article/list/' + str(num)response = self.s.get(url=url, headers=self.headers)html = response.textsoup = BeautifulSoup(html, "html.parser")articles = soup.find_all('div', attrs={"class":"article-item-box csdn-tracking-statistics"})for article in articles:article_title = article.a.text.strip().replace(' ',':')article_href = article.a['href']with ensure_memory(sys.getsizeof(self.TaskQueue.UnVisitedList)):self.TaskQueue.InsertUnVisitedList([article_title, article_href])def get_md(self, url):response = self.s.get(url=url, headers=self.headers)html = response.textsoup = BeautifulSoup(html, 'lxml')content = soup.select_one("#content_views")# 刪除注釋for useless_tag in content(text=lambda text: isinstance(text, Comment)):useless_tag.extract()# 刪除無用標簽tags = ["svg", "ul", ".hljs-button.signin"]delete_ele(content, tags)# 刪除標簽屬性attrs = ["class", "name", "id", "onclick", "style", "data-token", "rel"]delete_ele_attr(content,attrs)# 刪除空白標簽eles_except = ["img", "br", "hr"]delete_blank_ele(content, eles_except)# 轉換為markdownmd = Tomd(str(content)).markdownreturn mddef write_readme(self):print("+"*100)print("[++] 開始爬取 {} 的博文 ......".format(self.username))print("+"*100)reademe_path = result_file(self.username,file_name="README.md",folder_name=self.folder_name)with open(reademe_path,'w', encoding='utf-8') as reademe_file:readme_head = "# " + self.username + " 的博文\n"reademe_file.write(readme_head)for [article_title,article_href] in self.TaskQueue.UnVisitedList[::-1]:text = str(self.url_num) + '. [' + article_title + ']('+ article_href +')\n'reademe_file.write(text)self.url_num += 1self.url_num = 1def get_all_articles(self):try:while True:[article_title,article_href] = self.TaskQueue.PopUnVisitedList()try:file_name = re.sub(r'[\/::*?"<>|]','-', article_title) + ".md"artical_path = result_file(folder_username=self.username, file_name=file_name, folder_name=self.folder_name)md_head = "# " + article_title + "\n"md = md_head + self.get_md(article_href)print("[++++] 正在處理URL:{}".format(article_href))with open(artical_path, "w", encoding="utf-8") as artical_file:artical_file.write(md)except Exception:print("[----] 處理URL異常:{}".format(article_href))self.url_num += 1except Exception:passdef muti_spider(self, thread_num):while self.TaskQueue.getUnVisitedListLength() > 0:thread_list = []for i in range(thread_num):th = threading.Thread(target=self.get_all_articles)thread_list.append(th)for th in thread_list:th.start()lock = threading.Lock()
total_mem= 1024 * 1024 * 500 #500MB spare memory
@contextlib.contextmanager
def ensure_memory(size):global total_memwhile 1:with lock:if total_mem > size:total_mem-= sizebreaktime.sleep(5)yield with lock:total_mem += sizedef spider_user(username: str, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"):if not os.path.exists(folder_name):os.makedirs(folder_name)csdn = CSDN(username, folder_name, cookie_path)csdn.start()th1 = threading.Thread(target=csdn.write_readme)th1.start()th2 = threading.Thread(target=csdn.muti_spider, args=(thread_num,))th2.start()def spider(usernames: list, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"):for username in usernames:try:user_thread = threading.Thread(target=spider_user,args=(username, cookie_path, thread_num, folder_name))user_thread.start()print("[++] 開啟爬取 {} 博文進程成功 ......".format(username))except Exception:print("[--] 開啟爬取 {} 博文進程出現(xiàn)異常 ......".format(username))
我們可以自定義一個測試類運行一下,在本地文件位置會生成一個文件夾,并將markdown文件輸出出來
需要完整源碼的小伙伴可以加文末底部微信私信獲取哦,公眾號內有聯(lián)系方式
送書活動
- 🎁本次送書1~3本【取決于閱讀量,閱讀量越多,送的越多】👈
- ??活動時間:截止到2023-12月27號
- ??參與方式:關注博主+三連(點贊、收藏、評論)