當前位置：首頁 > news >正文

網(wǎng)站開發(fā)需要多線程嗎視頻優(yōu)化是什么意思

news 2025/7/7 14:14:57

網(wǎng)站開發(fā)需要多線程嗎,視頻優(yōu)化是什么意思,dreamweaver 網(wǎng)站導航,網(wǎng)站qq登錄原理前言「作者主頁」：雪碧有白泡泡「個人網(wǎng)站」：雪碧的個人網(wǎng)站 ChatGPT體驗地址文章目錄前言前言正則表達式進行轉換送書活動前言在信息爆炸的時代，互聯(lián)網(wǎng)上的海量文字信息如同無盡的沙灘。然而，其中真正有價值的信息往往埋…

前言

在這里插入圖片描述
「作者主頁」：雪碧有白泡泡
「個人網(wǎng)站」：雪碧的個人網(wǎng)站
請?zhí)砑訄D片描述

ChatGPT體驗地址

請?zhí)砑訄D片描述

文章目錄

前言
前言
正則表達式
進行轉換
送書活動

前言

在信息爆炸的時代，互聯(lián)網(wǎng)上的海量文字信息如同無盡的沙灘。然而，其中真正有價值的信息往往埋在各種網(wǎng)頁中，需要經(jīng)過篩選和整理才能被有效利用。幸運的是，Python這個強大的編程語言可以幫助我們完成這項任務。

本文將介紹如何使用Python將網(wǎng)頁文字轉換為Markdown格式，這將使得我們能夠更加方便地閱讀和處理網(wǎng)頁內容。無論是將文章保存為本地文件還是轉化為其他格式，Markdown都能夠提供清晰簡潔的排版和格式，讓我們更加專注于內容本身。

正則表達式

我們將頁面進行Maekdown的轉換為了保證準確度，我們可以使用正則表達式去修改，如下

import re__all__ = ['Tomd', 'convert']MARKDOWN = {'h1': ('\n# ', '\n'),'h2': ('\n## ', '\n'),'h3': ('\n### ', '\n'),'h4': ('\n#### ', '\n'),'h5': ('\n##### ', '\n'),'h6': ('\n###### ', '\n'),'code': ('`', '`'),'ul': ('', ''),'ol': ('', ''),'li': ('- ', ''),'blockquote': ('\n> ', '\n'),'em': ('**', '**'),'strong': ('**', '**'),'block_code': ('\n```\n', '\n```\n'),'span': ('', ''),'p': ('\n', '\n'),'p_with_out_class': ('\n', '\n'),'inline_p': ('', ''),'inline_p_with_out_class': ('', ''),'b': ('**', '**'),'i': ('*', '*'),'del': ('~~', '~~'),'hr': ('\n---', '\n\n'),'thead': ('\n', '|------\n'),'tbody': ('\n', '\n'),'td': ('|', ''),'th': ('|', ''),'tr': ('', '\n')
}BlOCK_ELEMENTS = {'h1': '<h1.*?>(.*?)</h1>','h2': '<h2.*?>(.*?)</h2>','h3': '<h3.*?>(.*?)</h3>','h4': '<h4.*?>(.*?)</h4>','h5': '<h5.*?>(.*?)</h5>','h6': '<h6.*?>(.*?)</h6>','hr': '<hr/>','blockquote': '<blockquote.*?>(.*?)</blockquote>','ul': '<ul.*?>(.*?)</ul>','ol': '<ol.*?>(.*?)</ol>','block_code': '<pre.*?><code.*?>(.*?)</code></pre>','p': '<p\s.*?>(.*?)</p>','p_with_out_class': '<p>(.*?)</p>','thead': '<thead.*?>(.*?)</thead>','tr': '<tr>(.*?)</tr>'
}INLINE_ELEMENTS = {'td': '<td>(.*?)</td>','tr': '<tr>(.*?)</tr>','th': '<th>(.*?)</th>','b': '<b>(.*?)</b>','i': '<i>(.*?)</i>','del': '<del>(.*?)</del>','inline_p': '<p\s.*?>(.*?)</p>','inline_p_with_out_class': '<p>(.*?)</p>','code': '<code.*?>(.*?)</code>','span': '<span.*?>(.*?)</span>','ul': '<ul.*?>(.*?)</ul>','ol': '<ol.*?>(.*?)</ol>','li': '<li.*?>(.*?)</li>','img': '<img.*?src="(.*?)".*?>(.*?)</img>','a': '<a.*?href="(.*?)".*?>(.*?)</a>','em': '<em.*?>(.*?)</em>','strong': '<strong.*?>(.*?)</strong>'
}DELETE_ELEMENTS = ['<span.*?>', '</span>', '<div.*?>', '</div>']class Element:def __init__(self, start_pos, end_pos, content, tag, is_block=False):self.start_pos = start_posself.end_pos = end_posself.content = contentself._elements = []self.is_block = is_blockself.tag = tagself._result = Noneif self.is_block:self.parse_inline()def __str__(self):wrapper = MARKDOWN.get(self.tag)self._result = '{}{}{}'.format(wrapper[0], self.content, wrapper[1])return self._resultdef parse_inline(self):for tag, pattern in INLINE_ELEMENTS.items():if tag == 'a':self.content = re.sub(pattern, '[\g<2>](\g<1>)', self.content)elif tag == 'img':self.content = re.sub(pattern, '![\g<2>](\g<1>)', self.content)elif self.tag == 'ul' and tag == 'li':self.content = re.sub(pattern, '- \g<1>', self.content)elif self.tag == 'ol' and tag == 'li':self.content = re.sub(pattern, '1. \g<1>', self.content)elif self.tag == 'thead' and tag == 'tr':self.content = re.sub(pattern, '\g<1>\n', self.content.replace('\n', ''))elif self.tag == 'tr' and tag == 'th':self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))elif self.tag == 'tr' and tag == 'td':self.content = re.sub(pattern, '|\g<1>', self.content.replace('\n', ''))else:wrapper = MARKDOWN.get(tag)self.content = re.sub(pattern, '{}\g<1>{}'.format(wrapper[0], wrapper[1]), self.content)class Tomd:def __init__(self, html='', options=None):self.html = htmlself.options = optionsself._markdown = ''def convert(self, html, options=None):elements = []for tag, pattern in BlOCK_ELEMENTS.items():for m in re.finditer(pattern, html, re.I | re.S | re.M):element = Element(start_pos=m.start(),end_pos=m.end(),content=''.join(m.groups()),tag=tag,is_block=True)can_append = Truefor e in elements:if e.start_pos < m.start() and e.end_pos > m.end():can_append = Falseelif e.start_pos > m.start() and e.end_pos < m.end():elements.remove(e)if can_append:elements.append(element)elements.sort(key=lambda element: element.start_pos)self._markdown = ''.join([str(e) for e in elements])for index, element in enumerate(DELETE_ELEMENTS):self._markdown = re.sub(element, '', self._markdown)return self._markdown@propertydef markdown(self):self.convert(self.html, self.options)return self._markdown_inst = Tomd()
convert = _inst.convert

這段代碼是一個用于將HTML轉換為Markdown的工具類。它使用了正則表達式來解析HTML標簽，并根據(jù)預定義的轉換規(guī)則將其轉換為對應的Markdown格式。

代碼中定義了一個Element類，用于表示HTML中的各個元素。Element類包含了標簽的起始位置、結束位置、內容、標簽類型等信息。它還提供了一個parse_inline方法，用于解析內聯(lián)元素，并將其轉換為Markdown格式。

Tomd類是主要的轉換類，它接受HTML字符串并提供了convert方法來執(zhí)行轉換操作。convert方法遍歷預定義的HTML標簽模式，并使用正則表達式匹配HTML字符串中對應的部分。然后創(chuàng)建相應的Element對象并進行轉換操作。最后，將轉換后的Markdown字符串返回。

在模塊頂部，MARKDOWN字典定義了各個HTML標簽對應的Markdown格式。BlOCK_ELEMENTS和INLINE_ELEMENTS字典定義了正則表達式模式，用于匹配HTML字符串中的塊級元素和內聯(lián)元素。DELETE_ELEMENTS列表定義了需要刪除的HTML元素。

那么既然有了轉markdown的工具，我們就可以對網(wǎng)頁進行轉換

進行轉換

首先，result_file函數(shù)用于創(chuàng)建一個保存結果文件的路徑。它接受文件夾的用戶名、文件名和文件夾名作為參數(shù)，并在指定的文件夾路徑下創(chuàng)建一個新的文件，并返回該文件的路徑。

get_headers函數(shù)用于從一個文本文件中讀取Cookie，并將它們保存為字典形式。它接受包含Cookie的文本文件路徑作為參數(shù)。

delete_ele函數(shù)用于刪除BeautifulSoup對象中指定的標簽。它接受一個BeautifulSoup對象和待刪除的標簽列表作為參數(shù)，并通過使用該對象的select方法來選擇要刪除的標簽，然后使用decompose方法進行刪除。

delete_ele_attr函數(shù)用于刪除BeautifulSoup對象中指定標簽的指定屬性。它接受一個BeautifulSoup對象和待刪除的屬性列表作為參數(shù)，并使用find_all方法來選取所有標簽，然后使用Python的del語句刪除指定的屬性。

delete_blank_ele函數(shù)用于刪除BeautifulSoup對象中的空白標簽。它接受一個BeautifulSoup對象和一個例外列表，對于不在例外列表中且內容為空的標簽，使用decompose方法進行刪除。

TaskQueue類是一個簡單的任務隊列，用于存儲已訪問的和未訪問的URL。它提供了一系列方法來操作這些列表。

def result_file(folder_username, file_name, folder_name):folder = os.path.join(os.path.dirname(os.path.realpath(__file__)), "..", folder_name, folder_username)if not os.path.exists(folder):try:os.makedirs(folder)except Exception:passpath = os.path.join(folder, file_name)file = open(path,"w")file.close()else:path = os.path.join(folder, file_name)return pathdef get_headers(cookie_path:str):cookies = {}with open(cookie_path, "r", encoding="utf-8") as f:cookie_list = f.readlines()for line in cookie_list:cookie = line.split(":")cookies[cookie[0]] = str(cookie[1]).strip()return cookiesdef delete_ele(soup:BeautifulSoup, tags:list):for ele in tags:for useless_tag in soup.select(ele):useless_tag.decompose()def delete_ele_attr(soup:BeautifulSoup, attrs:list):for attr in attrs:for useless_attr in soup.find_all():del useless_attr[attr]def delete_blank_ele(soup:BeautifulSoup, eles_except:list):for useless_attr in soup.find_all():try:if useless_attr.name not in eles_except and useless_attr.text == "":useless_attr.decompose()except Exception:passclass TaskQueue(object):def __init__(self):self.VisitedList = []self.UnVisitedList = []def getVisitedList(self):return self.VisitedListdef getUnVisitedList(self):return self.UnVisitedListdef InsertVisitedList(self, url):if url not in self.VisitedList:self.VisitedList.append(url)def InsertUnVisitedList(self, url):if url not in self.UnVisitedList:self.UnVisitedList.append(url)def RemoveVisitedList(self, url):self.VisitedList.remove(url)def PopUnVisitedList(self,index=0):url = []if index and self.UnVisitedList:url = self.UnVisitedList[index]del self.UnVisitedList[:index]elif self.UnVisitedList:url = self.UnVisitedList.pop()return urldef getUnVisitedListLength(self):return len(self.UnVisitedList)class CSDN(object):def __init__(self, username, folder_name, cookie_path):# self.headers = {# 	"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36"# }self.headers = get_headers(cookie_path)self.s = requests.Session()self.username = usernameself.TaskQueue = TaskQueue()self.folder_name = folder_nameself.url_num = 1def start(self):num = 0articles = [None]while len(articles) > 0:num += 1url = u'https://blog.csdn.net/' + self.username + '/article/list/' + str(num)response = self.s.get(url=url, headers=self.headers)html = response.textsoup = BeautifulSoup(html, "html.parser")articles = soup.find_all('div', attrs={"class":"article-item-box csdn-tracking-statistics"})for article in articles:article_title = article.a.text.strip().replace('        ','：')article_href = article.a['href']with ensure_memory(sys.getsizeof(self.TaskQueue.UnVisitedList)):self.TaskQueue.InsertUnVisitedList([article_title, article_href])def get_md(self, url):response = self.s.get(url=url, headers=self.headers)html = response.textsoup = BeautifulSoup(html, 'lxml')content = soup.select_one("#content_views")# 刪除注釋for useless_tag in content(text=lambda text: isinstance(text, Comment)):useless_tag.extract()# 刪除無用標簽tags = ["svg", "ul", ".hljs-button.signin"]delete_ele(content, tags)# 刪除標簽屬性attrs = ["class", "name", "id", "onclick", "style", "data-token", "rel"]delete_ele_attr(content,attrs)# 刪除空白標簽eles_except = ["img", "br", "hr"]delete_blank_ele(content, eles_except)# 轉換為markdownmd = Tomd(str(content)).markdownreturn mddef write_readme(self):print("+"*100)print("[++] 開始爬取 {} 的博文 ......".format(self.username))print("+"*100)reademe_path = result_file(self.username,file_name="README.md",folder_name=self.folder_name)with open(reademe_path,'w', encoding='utf-8') as reademe_file:readme_head = "# " + self.username + " 的博文\n"reademe_file.write(readme_head)for [article_title,article_href] in self.TaskQueue.UnVisitedList[::-1]:text = str(self.url_num) + '. [' + article_title + ']('+ article_href +')\n'reademe_file.write(text)self.url_num += 1self.url_num = 1def get_all_articles(self):try:while True:[article_title,article_href] = self.TaskQueue.PopUnVisitedList()try:file_name = re.sub(r'[\/:：*?"<>|]','-', article_title) + ".md"artical_path = result_file(folder_username=self.username, file_name=file_name, folder_name=self.folder_name)md_head = "# " + article_title + "\n"md = md_head + self.get_md(article_href)print("[++++] 正在處理URL：{}".format(article_href))with open(artical_path, "w", encoding="utf-8") as artical_file:artical_file.write(md)except Exception:print("[----] 處理URL異常：{}".format(article_href))self.url_num += 1except Exception:passdef muti_spider(self, thread_num):while self.TaskQueue.getUnVisitedListLength() > 0:thread_list = []for i in range(thread_num):th = threading.Thread(target=self.get_all_articles)thread_list.append(th)for th in thread_list:th.start()lock = threading.Lock()
total_mem= 1024 * 1024 * 500 #500MB spare memory
@contextlib.contextmanager
def ensure_memory(size):global total_memwhile 1:with lock:if total_mem > size:total_mem-= sizebreaktime.sleep(5)yield with lock:total_mem += sizedef spider_user(username: str, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"):if not os.path.exists(folder_name):os.makedirs(folder_name)csdn = CSDN(username, folder_name, cookie_path)csdn.start()th1 = threading.Thread(target=csdn.write_readme)th1.start()th2 = threading.Thread(target=csdn.muti_spider, args=(thread_num,))th2.start()def spider(usernames: list, cookie_path:str, thread_num: int = 10, folder_name: str = "articles"):for username in usernames:try:user_thread = threading.Thread(target=spider_user,args=(username, cookie_path, thread_num, folder_name))user_thread.start()print("[++] 開啟爬取 {} 博文進程成功 ......".format(username))except Exception:print("[--] 開啟爬取 {} 博文進程出現(xiàn)異常 ......".format(username))