Web 爬虫与自动化数据获取¶

一句话说明：用 Python 自动从网页和 API 抓取论文、基因信息、生信资源等数据，省掉手动复制粘贴的重复劳动。

为什么生信人要学爬虫¶

场景	手动做	爬虫做
下载 100 篇 PubMed 论文摘要	逐个搜索、复制、粘贴，2 小时	脚本 30 秒跑完
批量获取 NCBI 基因注释	一个一个查，容易漏	循环请求，结果自动存表
追踪 GitHub 生信工具更新	每天手动刷页面	定时脚本 + 邮件通知
收集数据库样本元数据	手动翻页、截图	自动翻页、结构化存储

白话总结：爬虫就是让电脑模拟你打开浏览器、找到数据、复制回来这个过程。你教它一次，它能重复跑一万次。

核心概念白话版¶

1. HTTP 请求：浏览器和服务器的对话¶

你（浏览器）→ "我要看这个页面" → 服务器
服务器 → "给你，页面内容" → 你（浏览器）

GET 请求：只是"看"数据（比如打开网页）
POST 请求：要"提交"数据（比如登录、搜索）
状态码：200 = 成功，404 = 页面不存在，403 = 被拒绝，500 = 服务器出错

2. HTML 结构：网页的骨架¶

HTML 就像一棵树，每个标签是一个节点：

<html>
  <body>
    <div class="article">
      <h1>论文标题</h1>        <!-- h1 标签 = 大标题 -->
      <p class="abstract">摘要内容</p>  <!-- p 标签 = 段落 -->
      <a href="https://...">链接</a>    <!-- a 标签 = 链接 -->
    </div>
  </body>
</html>

白话：HTML 就是把网页内容用标签包起来。爬虫的工作就是找到你要的那个标签，把里面的文字掏出来。

3. CSS 选择器：精准定位元素¶

选择器	含义	示例
`div`	标签名	所有 div
`.abstract`	class 名	class="abstract" 的元素
`#title`	id 名	id="title" 的元素
`div.article p`	嵌套	article 类 div 下的所有 p
`a[href]`	属性	有 href 属性的 a 标签

4. XPath：另一种定位方式¶

XPath 用路径表达式定位，像文件路径一样：

//div[@class="abstract"]/p      # 找 class=abstract 的 div 下的 p
//a/@href                        # 找所有 a 标签的 href 属性值
//h1/text()                      # 找 h1 标签的文本内容

白话：CSS 选择器和 XPath 都是"地址"，告诉程序去网页的哪个位置取数据。CSS 选择器更简洁，XPath 更灵活。初学者用 CSS 选择器就够了。

5. 反爬机制：网站的防御¶

网站不希望被大量自动请求，常见防御手段：

频率限制：短时间请求太多直接封 IP
验证码：弹出图片验证，机器难以通过
User-Agent 检测：检查是不是浏览器发的请求
登录墙：必须登录才能看内容
动态渲染：数据由 JavaScript 生成，直接请求拿不到

6. robots.txt：网站的"告示牌"¶

每个网站根目录下的 robots.txt 文件声明了哪些路径允许爬、哪些不允许：

# https://pubmed.ncbi.nlm.nih.gov/robots.txt
User-agent: *
Disallow: /account/       # 不允许爬账户页面
Allow: /                  # 其他页面允许
Crawl-delay: 1            # 每次请求间隔 1 秒

白话：robots.txt 是网站的"规矩"，爬之前先看一眼，遵守别人的规则。

7. API vs 爬虫¶

对比	API	爬虫
数据格式	结构化 JSON	需要从 HTML 解析
稳定性	高，有文档	低，页面改版就挂
速度	快	慢（要解析 HTML）
合规性	官方提供，合规	灰色地带，需谨慎
限制	有频率限制和 API Key	IP 可能被封

原则：能用 API 就用 API，API 拿不到的再考虑爬虫。

requests 库教程¶

安装¶

pip install requests          # 安装 requests 库
pip install beautifulsoup4    # 安装 BeautifulSoup（后面要用）
pip install lxml              # 安装更快的 HTML 解析器

GET 请求：获取网页¶

import requests  # 导入 requests 库

# === 最基本的 GET 请求 ===
url = "https://httpbin.org/get"  # 测试网址（会返回你的请求信息）
response = requests.get(url)     # 发送 GET 请求

print(response.status_code)  # 打印状态码（200 表示成功）
print(response.text)         # 打印返回的文本内容
print(response.json())       # 如果返回的是 JSON，直接解析为字典

POST 请求：提交数据¶

import requests  # 导入 requests 库

# === POST 请求（模拟表单提交） ===
url = "https://httpbin.org/post"  # 测试网址
data = {                           # 要提交的数据（字典格式）
    "query": "metagenomics",       # 搜索关键词
    "page": 1                      # 页码
}
response = requests.post(url, data=data)  # 发送 POST 请求
print(response.json())                     # 打印返回结果

设置请求头 Headers¶

import requests  # 导入 requests 库

# === 自定义 Headers（伪装成浏览器） ===
headers = {
    # User-Agent 告诉服务器你是什么浏览器，不设置的话默认是 python-requests
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36",
    # Accept 告诉服务器你能接受什么格式的响应
    "Accept": "text/html,application/json",
    # Accept-Language 告诉服务器你偏好的语言
    "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8"
}

url = "https://httpbin.org/headers"      # 测试网址
response = requests.get(url, headers=headers)  # 带 headers 发请求
print(response.json())                          # 查看服务器收到的 headers

Session：保持会话状态¶

import requests  # 导入 requests 库

# === Session 会话（自动保持 Cookie） ===
# 白话：Session 就像你登录了浏览器，后续请求自动带着登录状态
session = requests.Session()  # 创建一个会话对象

# 设置全局 headers，这个 session 发出的所有请求都会带上
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/120.0.0.0 Safari/537.36"
})

# 第一次请求（比如登录页面），Cookie 会自动保存
response1 = session.get("https://httpbin.org/cookies/set/token/abc123")
# 第二次请求会自动带上之前存的 Cookie
response2 = session.get("https://httpbin.org/cookies")
print(response2.json())  # 可以看到 Cookie 被自动携带了

代理设置¶

import requests  # 导入 requests 库

# === 设置代理（当 IP 被封时用） ===
proxies = {
    "http": "http://代理IP:端口",    # HTTP 代理地址
    "https": "http://代理IP:端口"    # HTTPS 代理地址
}

# 注意：这里用的是示例地址，实际使用要替换成真实代理
# response = requests.get("https://example.com", proxies=proxies, timeout=10)

# === 超时设置（防止程序卡死） ===
try:
    response = requests.get(
        "https://httpbin.org/delay/2",  # 这个网址会延迟 2 秒响应
        timeout=5                        # 最多等 5 秒，超时就报错
    )
    print(response.status_code)  # 打印状态码
except requests.exceptions.Timeout:  # 捕获超时异常
    print("请求超时，稍后重试")
except requests.exceptions.ConnectionError:  # 捕获连接错误
    print("连接失败，检查网络或 URL")

BeautifulSoup 教程¶

解析 HTML¶

from bs4 import BeautifulSoup  # 从 bs4 库导入 BeautifulSoup

# === 示例 HTML（模拟一个论文列表页面） ===
html_doc = """
<html>
<body>
  <div class="results">
    <div class="article" id="art1">
      <h2 class="title">Gut microbiome in T2D patients</h2>
      <span class="authors">Zhang et al.</span>
      <p class="abstract">We analyzed gut microbiome composition...</p>
      <a href="https://pubmed.ncbi.nlm.nih.gov/12345678/">Full text</a>
    </div>
    <div class="article" id="art2">
      <h2 class="title">Metagenomic analysis of diabetes</h2>
      <span class="authors">Li et al.</span>
      <p class="abstract">Shotgun metagenomic sequencing revealed...</p>
      <a href="https://pubmed.ncbi.nlm.nih.gov/87654321/">Full text</a>
    </div>
  </div>
</body>
</html>
"""

# 创建 BeautifulSoup 对象，'lxml' 是解析器（比默认的快）
soup = BeautifulSoup(html_doc, "lxml")

查找元素¶

# === find()：找第一个匹配的元素 ===
first_title = soup.find("h2", class_="title")  # 找第一个 class=title 的 h2
print(first_title.text)  # 输出: Gut microbiome in T2D patients

# === find_all()：找所有匹配的元素（返回列表） ===
all_titles = soup.find_all("h2", class_="title")  # 找所有 class=title 的 h2
for title in all_titles:        # 遍历每个标题
    print(title.text)           # 打印标题文本

# === select()：用 CSS 选择器查找 ===
abstracts = soup.select("div.article p.abstract")  # div.article 下的 p.abstract
for ab in abstracts:            # 遍历每个摘要
    print(ab.text)              # 打印摘要内容

# === 获取属性值 ===
links = soup.find_all("a")     # 找所有 a 标签
for link in links:              # 遍历
    href = link.get("href")    # 获取 href 属性（链接地址）
    text = link.text           # 获取链接文本
    print(f"{text} -> {href}")  # 打印：文本 -> 链接

提取数据（结构化）¶

from bs4 import BeautifulSoup  # 导入 BeautifulSoup

# 假设 soup 已经创建好了（见上面的代码）
articles = []  # 用列表存所有论文信息

# 找到所有 class=article 的 div（每个 div 是一篇论文）
for div in soup.find_all("div", class_="article"):
    article = {  # 每篇论文存为一个字典
        "id": div.get("id"),                           # 论文 ID（从 id 属性取）
        "title": div.find("h2").text.strip(),          # 标题（h2 标签的文本）
        "authors": div.find("span", class_="authors").text.strip(),  # 作者
        "abstract": div.find("p", class_="abstract").text.strip(),   # 摘要
        "link": div.find("a").get("href")              # 链接（a 标签的 href）
    }
    articles.append(article)  # 加入列表

# 打印结果
for art in articles:
    print(f"标题: {art['title']}")
    print(f"作者: {art['authors']}")
    print(f"摘要: {art['abstract'][:50]}...")  # 摘要只显示前 50 字
    print(f"链接: {art['link']}")
    print("---")

实战案例¶

案例 1：爬取 PubMed 论文摘要¶

import requests                     # HTTP 请求库
from bs4 import BeautifulSoup       # HTML 解析库
import time                         # 时间库（用于请求间隔）
import csv                          # CSV 文件操作库

def search_pubmed(query, max_results=10):
    """
    搜索 PubMed 并返回论文摘要列表

    参数:
        query: 搜索关键词（如 "metagenomics T2D"）
        max_results: 最多返回多少条（默认 10）
    返回:
        论文信息列表（字典列表）
    """
    # --- 第一步：通过 NCBI E-utilities API 搜索，获取 PMID 列表 ---
    # E-utilities 是 NCBI 官方提供的 API，比直接爬网页更稳定合规
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    search_params = {
        "db": "pubmed",          # 搜索数据库：PubMed
        "term": query,           # 搜索关键词
        "retmax": max_results,   # 最多返回条数
        "retmode": "json",       # 返回 JSON 格式
        "sort": "relevance"      # 按相关性排序
    }

    print(f"正在搜索 PubMed: {query}")
    response = requests.get(search_url, params=search_params, timeout=15)
    response.raise_for_status()  # 如果状态码不是 200，抛出异常

    data = response.json()  # 解析 JSON 响应
    pmid_list = data["esearchresult"]["idlist"]  # 提取 PMID 列表
    print(f"找到 {len(pmid_list)} 篇论文")

    if not pmid_list:  # 如果没找到论文
        return []

    # --- 第二步：用 PMID 批量获取论文详细信息 ---
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    fetch_params = {
        "db": "pubmed",                  # 数据库
        "id": ",".join(pmid_list),       # 用逗号拼接所有 PMID
        "rettype": "abstract",           # 返回摘要
        "retmode": "xml"                 # 返回 XML 格式（方便解析）
    }

    time.sleep(0.5)  # 间隔 0.5 秒，遵守 NCBI 频率限制（每秒最多 3 次）
    response = requests.get(fetch_url, params=fetch_params, timeout=30)
    response.raise_for_status()

    # --- 第三步：解析 XML 结果 ---
    soup = BeautifulSoup(response.text, "lxml-xml")  # 用 lxml-xml 解析 XML
    articles = []  # 存放结果的列表

    for article in soup.find_all("PubmedArticle"):  # 遍历每篇论文
        # 提取 PMID
        pmid = article.find("PMID").text if article.find("PMID") else "N/A"

        # 提取标题
        title_tag = article.find("ArticleTitle")
        title = title_tag.text.strip() if title_tag else "无标题"

        # 提取摘要（可能有多个段落，拼接起来）
        abstract_parts = article.find_all("AbstractText")
        abstract = " ".join(part.text for part in abstract_parts) if abstract_parts else "无摘要"

        # 提取作者列表
        authors = []
        for author in article.find_all("Author"):
            last = author.find("LastName")    # 姓
            first = author.find("ForeName")   # 名
            if last and first:
                authors.append(f"{last.text} {first.text}")
        author_str = ", ".join(authors) if authors else "未知作者"

        # 提取期刊名
        journal_tag = article.find("Journal")
        journal = ""
        if journal_tag:
            journal_title = journal_tag.find("Title")
            journal = journal_title.text if journal_title else "未知期刊"

        # 提取发表年份
        year_tag = article.find("PubDate")
        year = year_tag.find("Year").text if year_tag and year_tag.find("Year") else "未知年份"

        articles.append({
            "pmid": pmid,
            "title": title,
            "authors": author_str,
            "journal": journal,
            "year": year,
            "abstract": abstract
        })

        print(f"  已获取: PMID {pmid} - {title[:50]}...")

    return articles

# === 使用示例 ===
if __name__ == "__main__":
    results = search_pubmed("gut metagenomics type 2 diabetes", max_results=5)

    # 保存到 CSV 文件
    if results:
        with open("pubmed_results.csv", "w", newline="", encoding="utf-8-sig") as f:
            writer = csv.DictWriter(f, fieldnames=results[0].keys())
            writer.writeheader()       # 写表头
            writer.writerows(results)  # 写数据行
        print(f"\n结果已保存到 pubmed_results.csv，共 {len(results)} 篇")

案例 2：爬取 NCBI 基因信息¶

import requests  # HTTP 请求库
import time      # 延时库
import json      # JSON 处理库

def get_gene_info(gene_symbol, organism="human"):
    """
    通过 NCBI E-utilities API 获取基因信息

    参数:
        gene_symbol: 基因符号（如 "BRCA1", "TP53"）
        organism: 物种（默认 human）
    返回:
        基因信息字典
    """
    # --- 第一步：搜索基因，获取 Gene ID ---
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    search_params = {
        "db": "gene",                                    # 搜索 Gene 数据库
        "term": f"{gene_symbol}[Gene Name] AND {organism}[Organism]",  # 搜索条件
        "retmode": "json"                                # 返回 JSON
    }

    print(f"正在查询基因: {gene_symbol} ({organism})")
    resp = requests.get(search_url, params=search_params, timeout=15)
    resp.raise_for_status()  # 检查请求是否成功

    ids = resp.json()["esearchresult"]["idlist"]  # 提取 Gene ID 列表
    if not ids:  # 如果没找到
        print(f"未找到基因: {gene_symbol}")
        return None

    gene_id = ids[0]  # 取第一个（最相关的）
    print(f"找到 Gene ID: {gene_id}")

    # --- 第二步：获取基因详细摘要 ---
    time.sleep(0.4)  # 间隔 0.4 秒
    summary_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    summary_params = {
        "db": "gene",         # 数据库
        "id": gene_id,        # Gene ID
        "retmode": "json"     # 返回 JSON
    }

    resp = requests.get(summary_url, params=summary_params, timeout=15)
    resp.raise_for_status()

    result = resp.json()["result"]    # 解析结果
    gene_data = result.get(gene_id, {})  # 取对应 ID 的数据

    # 提取关键信息
    info = {
        "gene_id": gene_id,
        "symbol": gene_data.get("name", gene_symbol),           # 基因符号
        "full_name": gene_data.get("description", "N/A"),       # 基因全名
        "organism": gene_data.get("organism", {}).get("scientificname", organism),  # 物种
        "chromosome": gene_data.get("chromosome", "N/A"),       # 染色体位置
        "map_location": gene_data.get("maplocation", "N/A"),    # 图谱位置
        "gene_type": gene_data.get("genetictype", "N/A"),       # 基因类型
        "summary": gene_data.get("summary", "无摘要")            # 功能摘要
    }

    return info

# === 批量查询示例 ===
if __name__ == "__main__":
    # 生信常关注的基因列表
    genes = ["BRCA1", "TP53", "EGFR", "INS", "TNF"]
    all_results = []  # 存放所有结果

    for gene in genes:
        info = get_gene_info(gene)  # 查询每个基因
        if info:
            all_results.append(info)
            print(f"  {info['symbol']}: {info['full_name']}")
            print(f"  染色体: {info['chromosome']}, 位置: {info['map_location']}")
            print()
        time.sleep(0.5)  # 每次查询间隔 0.5 秒

    # 保存为 JSON 文件
    with open("gene_info.json", "w", encoding="utf-8") as f:
        json.dump(all_results, f, ensure_ascii=False, indent=2)  # 中文不转义，缩进 2 格
    print(f"结果已保存到 gene_info.json，共 {len(all_results)} 个基因")

案例 3：爬取生信工具文档¶

import requests                  # HTTP 请求库
from bs4 import BeautifulSoup    # HTML 解析库

def scrape_tool_docs(url):
    """
    爬取生信工具的在线文档页面，提取标题和正文内容
    以 BioPython 官方教程页面为例

    参数:
        url: 文档页面的 URL
    返回:
        文档内容字典
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/120.0.0.0 Safari/537.36"
    }

    print(f"正在获取文档: {url}")
    response = requests.get(url, headers=headers, timeout=20)
    response.raise_for_status()
    response.encoding = response.apparent_encoding  # 自动检测编码，防止中文乱码

    soup = BeautifulSoup(response.text, "lxml")  # 解析 HTML

    # 提取页面标题
    title = soup.find("title")
    page_title = title.text.strip() if title else "无标题"

    # 提取所有标题（h1-h4）和对应段落
    sections = []  # 存放各节内容
    for heading in soup.find_all(["h1", "h2", "h3", "h4"]):
        section = {
            "level": heading.name,             # 标题级别（h1/h2/h3/h4）
            "title": heading.text.strip(),     # 标题文本
            "content": ""                      # 对应内容
        }

        # 收集标题后面的所有段落，直到遇到下一个标题
        content_parts = []
        for sibling in heading.find_next_siblings():  # 遍历后面的兄弟元素
            if sibling.name in ["h1", "h2", "h3", "h4"]:  # 遇到下一个标题就停
                break
            if sibling.name in ["p", "li", "pre", "code"]:  # 只取段落/列表/代码
                content_parts.append(sibling.text.strip())

        section["content"] = "\n".join(content_parts)  # 拼接内容
        sections.append(section)

    # 提取所有代码块
    code_blocks = []
    for code in soup.find_all("pre"):  # pre 标签通常包裹代码块
        code_blocks.append(code.text.strip())

    return {
        "title": page_title,
        "url": url,
        "sections": sections,
        "code_blocks": code_blocks
    }

# === 使用示例 ===
if __name__ == "__main__":
    # 爬取 Biopython 教程页面（选一个实际存在的文档页面）
    doc = scrape_tool_docs("https://biopython.org/wiki/SeqIO")

    print(f"\n文档标题: {doc['title']}")
    print(f"共 {len(doc['sections'])} 个章节")
    print(f"共 {len(doc['code_blocks'])} 个代码块")

    # 打印前 3 个章节
    for sec in doc["sections"][:3]:
        print(f"\n[{sec['level']}] {sec['title']}")
        if sec["content"]:
            print(f"  {sec['content'][:100]}...")  # 只显示前 100 字

案例 4：爬取 GitHub 项目信息¶

import requests  # HTTP 请求库
import time      # 延时库
import json      # JSON 处理库

def get_github_repos(query, sort="stars", max_results=10):
    """
    通过 GitHub API 搜索生信相关项目

    参数:
        query: 搜索关键词（如 "metagenomics pipeline"）
        sort: 排序方式（stars/forks/updated）
        max_results: 最多返回条数
    返回:
        项目信息列表
    """
    # GitHub 提供官方 REST API，不需要爬网页
    url = "https://api.github.com/search/repositories"
    headers = {
        "Accept": "application/vnd.github.v3+json",  # 指定 API 版本
        # 如果有 GitHub Token 可以提高频率限制（每小时 5000 次 vs 60 次）
        # "Authorization": "token YOUR_GITHUB_TOKEN"
    }
    params = {
        "q": query,              # 搜索关键词
        "sort": sort,            # 排序方式
        "order": "desc",         # 降序
        "per_page": max_results  # 每页结果数
    }

    print(f"正在搜索 GitHub: {query}")
    response = requests.get(url, headers=headers, params=params, timeout=15)
    response.raise_for_status()

    data = response.json()  # 解析 JSON
    repos = []  # 存放结果

    for item in data.get("items", []):  # 遍历搜索结果
        repo = {
            "name": item["full_name"],                  # 仓库全名（用户名/仓库名）
            "description": item.get("description", ""), # 项目描述
            "url": item["html_url"],                    # GitHub 页面链接
            "stars": item["stargazers_count"],           # Star 数
            "forks": item["forks_count"],                # Fork 数
            "language": item.get("language", "N/A"),     # 主要编程语言
            "updated": item["updated_at"][:10],          # 最后更新日期
            "license": item.get("license", {}).get("spdx_id", "N/A") if item.get("license") else "N/A"
        }
        repos.append(repo)
        print(f"  {repo['name']} - Stars: {repo['stars']}, Language: {repo['language']}")

    return repos

# === 使用示例 ===
if __name__ == "__main__":
    # 搜索宏基因组相关工具
    results = get_github_repos("metagenomics pipeline bioinformatics", max_results=10)

    # 保存结果
    with open("github_bioinfo_tools.json", "w", encoding="utf-8") as f:
        json.dump(results, f, ensure_ascii=False, indent=2)
    print(f"\n共找到 {len(results)} 个项目，已保存到 github_bioinfo_tools.json")

API 调用教程¶

API 和爬虫的区别¶

爬虫是"强行"从网页上扣数据，API 是网站"主动"提供的数据接口。

爬虫流程: 请求网页 → 拿到 HTML → 解析 HTML → 提取数据
API 流程:  请求接口 → 拿到 JSON → 直接用

REST API 基础¶

REST API 的核心思想：用 URL 定位资源，用 HTTP 方法操作资源。

HTTP 方法	用途	示例
GET	获取数据	获取论文信息
POST	创建数据	提交分析任务
PUT	更新数据	修改配置
DELETE	删除数据	删除任务

用 requests 调 API 的通用模板¶

import requests  # 导入 requests 库
import json      # JSON 处理

def call_api(base_url, endpoint, params=None, headers=None, method="GET"):
    """
    通用 API 调用函数

    参数:
        base_url: API 基础地址（如 "https://api.example.com"）
        endpoint: 接口路径（如 "/v1/search"）
        params: 查询参数（字典）
        headers: 请求头（字典）
        method: 请求方法（GET/POST）
    返回:
        JSON 响应数据
    """
    url = f"{base_url}{endpoint}"  # 拼接完整 URL

    # 默认请求头
    default_headers = {
        "Accept": "application/json",     # 期望返回 JSON
        "Content-Type": "application/json" # 发送的数据也是 JSON
    }
    if headers:  # 如果有自定义 headers，合并进去
        default_headers.update(headers)

    try:
        if method.upper() == "GET":
            resp = requests.get(url, params=params, headers=default_headers, timeout=15)
        elif method.upper() == "POST":
            resp = requests.post(url, json=params, headers=default_headers, timeout=15)
        else:
            raise ValueError(f"不支持的方法: {method}")

        resp.raise_for_status()  # 检查状态码
        return resp.json()       # 返回解析后的 JSON

    except requests.exceptions.HTTPError as e:
        print(f"HTTP 错误: {e.response.status_code} - {e.response.text[:200]}")
    except requests.exceptions.Timeout:
        print("请求超时")
    except requests.exceptions.ConnectionError:
        print("连接失败")

    return None  # 出错时返回 None

# === 调用 NCBI E-utilities API 示例 ===
if __name__ == "__main__":
    result = call_api(
        base_url="https://eutils.ncbi.nlm.nih.gov",
        endpoint="/entrez/eutils/esearch.fcgi",
        params={
            "db": "pubmed",
            "term": "CRISPR metagenomics",
            "retmode": "json",
            "retmax": 5
        }
    )

    if result:
        ids = result["esearchresult"]["idlist"]
        print(f"找到 PMID: {ids}")

数据存储¶

保存为 CSV¶

import csv  # Python 内置 CSV 库

data = [
    {"pmid": "12345", "title": "Paper A", "year": "2024"},
    {"pmid": "67890", "title": "Paper B", "year": "2023"},
]

# --- 写入 CSV ---
# encoding="utf-8-sig" 解决 Excel 打开中文乱码问题（加 BOM 头）
with open("output.csv", "w", newline="", encoding="utf-8-sig") as f:
    writer = csv.DictWriter(f, fieldnames=["pmid", "title", "year"])  # 指定列名
    writer.writeheader()       # 写入表头
    writer.writerows(data)     # 写入所有数据行

# --- 读取 CSV ---
with open("output.csv", "r", encoding="utf-8-sig") as f:
    reader = csv.DictReader(f)  # 按字典方式读
    for row in reader:          # 遍历每行
        print(row["title"])     # 按列名取值

保存为 JSON¶

import json  # Python 内置 JSON 库

data = [
    {"gene": "BRCA1", "function": "DNA repair"},
    {"gene": "TP53", "function": "Tumor suppressor"},
]

# --- 写入 JSON ---
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(
        data,                   # 要写入的数据
        f,                      # 文件对象
        ensure_ascii=False,     # 中文不转义（不然会变成 \u4e2d 这种）
        indent=2                # 缩进 2 格（方便阅读）
    )

# --- 读取 JSON ---
with open("output.json", "r", encoding="utf-8") as f:
    loaded = json.load(f)       # 读取并解析
    for item in loaded:
        print(f"{item['gene']}: {item['function']}")

保存到 SQLite 数据库¶

import sqlite3  # Python 内置 SQLite 库（不用额外安装）

# --- 创建数据库和表 ---
conn = sqlite3.connect("biodata.db")  # 连接数据库（文件不存在会自动创建）
cursor = conn.cursor()                 # 创建游标

# 创建表（如果不存在）
cursor.execute("""
    CREATE TABLE IF NOT EXISTS papers (
        pmid TEXT PRIMARY KEY,          -- 主键，PMID
        title TEXT NOT NULL,            -- 标题，不能为空
        authors TEXT,                   -- 作者
        abstract TEXT,                  -- 摘要
        year INTEGER                    -- 年份
    )
""")

# --- 插入数据 ---
papers = [
    ("12345", "Gut microbiome study", "Zhang et al.", "We analyzed...", 2024),
    ("67890", "Metagenomics analysis", "Li et al.", "Shotgun...", 2023),
]

# INSERT OR IGNORE：如果 PMID 已存在就跳过，不会报错
cursor.executemany(
    "INSERT OR IGNORE INTO papers VALUES (?, ?, ?, ?, ?)",
    papers
)
conn.commit()  # 提交事务（写入磁盘）

# --- 查询数据 ---
cursor.execute("SELECT pmid, title, year FROM papers WHERE year >= 2023")
for row in cursor.fetchall():  # 获取所有结果
    print(f"PMID: {row[0]}, 标题: {row[1]}, 年份: {row[2]}")

conn.close()  # 关闭数据库连接

反爬应对策略¶

1. 请求延时¶

import time    # 时间库
import random  # 随机数库

# 固定延时
time.sleep(1)  # 每次请求后等 1 秒

# 随机延时（更自然，不容易被检测到固定模式）
delay = random.uniform(0.5, 2.0)  # 0.5 到 2 秒之间随机
time.sleep(delay)

2. 随机 User-Agent¶

import random  # 随机数库

# User-Agent 池（模拟不同浏览器）
UA_LIST = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 Safari/605.1.15",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/119.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36",
]

def get_random_headers():
    """生成随机请求头"""
    return {
        "User-Agent": random.choice(UA_LIST),  # 随机选一个 UA
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive"
    }

3. 代理池（简易版）¶

import random    # 随机数库
import requests  # HTTP 请求库

# 代理池（实际使用时需要替换成真实可用的代理）
PROXY_POOL = [
    "http://proxy1.example.com:8080",
    "http://proxy2.example.com:8080",
    "http://proxy3.example.com:8080",
]

def request_with_proxy(url, max_retries=3):
    """
    使用代理池发送请求，失败自动重试

    参数:
        url: 目标网址
        max_retries: 最大重试次数
    返回:
        Response 对象或 None
    """
    for attempt in range(max_retries):  # 最多重试 max_retries 次
        proxy = random.choice(PROXY_POOL)  # 随机选一个代理
        proxies = {"http": proxy, "https": proxy}

        try:
            resp = requests.get(
                url,
                proxies=proxies,
                headers=get_random_headers(),  # 随机 UA
                timeout=10
            )
            resp.raise_for_status()
            return resp  # 成功就返回
        except Exception as e:
            print(f"第 {attempt + 1} 次尝试失败 (代理: {proxy}): {e}")
            continue

    print("所有重试均失败")
    return None

法律与道德注意事项¶

先查 robots.txt：访问 网站域名/robots.txt，遵守 Disallow 和 Crawl-delay 规则
优先用官方 API：NCBI E-utilities、GitHub API、UniProt API 都有官方接口
控制请求频率：NCBI 要求每秒不超过 3 次请求（有 API Key 可以放宽到 10 次）
不爬个人隐私数据：不爬用户邮箱、手机号、个人信息
不爬付费/版权内容：不爬需要付费订阅的全文论文
注明数据来源：在论文或报告中注明数据是从哪里获取的
仅用于学术研究：不用于商业目的或恶意用途
遵守网站 ToS：阅读网站的使用条款（Terms of Service）

常见报错与解决¶

1. ConnectionError：连接失败¶

requests.exceptions.ConnectionError: Max retries exceeded

原因：网络问题、URL 错误、或 IP 被封

解决：

# 检查网络 → 检查 URL 拼写 → 加重试机制
from requests.adapters import HTTPAdapter   # 请求适配器
from urllib3.util.retry import Retry        # 重试策略

session = requests.Session()
retry = Retry(total=3, backoff_factor=1)    # 重试 3 次，间隔递增
adapter = HTTPAdapter(max_retries=retry)    # 创建适配器
session.mount("http://", adapter)           # 挂载到 session
session.mount("https://", adapter)

2. Timeout：请求超时¶

requests.exceptions.ReadTimeout: Read timed out

原因：服务器响应太慢或网络不好

解决：增大 timeout 值，或加 try-except 重试

response = requests.get(url, timeout=(5, 30))  # 连接超时 5 秒，读取超时 30 秒

3. HTTPError 403：被拒绝¶

requests.exceptions.HTTPError: 403 Client Error: Forbidden

原因：没有设置 User-Agent，或被反爬机制拦截

解决：添加 User-Agent 请求头，降低请求频率

4. UnicodeDecodeError：编码错误¶

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4

原因：网页编码不是 UTF-8

解决：

response.encoding = response.apparent_encoding  # 自动检测编码
# 或者手动指定
response.encoding = "gbk"  # 中文网站可能是 GBK 编码

5. AttributeError: NoneType¶

AttributeError: 'NoneType' object has no attribute 'text'

原因：find() 没找到元素，返回了 None

解决：加判空检查

tag = soup.find("h1")
text = tag.text if tag else "未找到"  # 如果 tag 是 None 就用默认值

6. JSONDecodeError：JSON 解析失败¶

json.decoder.JSONDecodeError: Expecting value

原因：返回的不是 JSON 格式（可能是 HTML 错误页面）

解决：

if response.headers.get("Content-Type", "").startswith("application/json"):
    data = response.json()   # 确认是 JSON 再解析
else:
    print(f"返回的不是 JSON: {response.text[:200]}")  # 打印前 200 字排查

速查表¶

requests 常用操作¶

操作	代码
GET 请求	`requests.get(url)`
POST 请求	`requests.post(url, data=dict)`
带 Headers	`requests.get(url, headers=dict)`
带参数	`requests.get(url, params=dict)`
设超时	`requests.get(url, timeout=10)`
设代理	`requests.get(url, proxies=dict)`
获取状态码	`response.status_code`
获取文本	`response.text`
获取 JSON	`response.json()`
获取二进制	`response.content`
检查错误	`response.raise_for_status()`

BeautifulSoup 常用操作¶

操作	代码
创建对象	`soup = BeautifulSoup(html, "lxml")`
找第一个	`soup.find("tag", class_="name")`
找所有	`soup.find_all("tag", class_="name")`
CSS 选择器	`soup.select("div.class > p")`
获取文本	`tag.text` 或 `tag.get_text()`
获取属性	`tag.get("href")` 或 `tag["href"]`
按 ID 找	`soup.find(id="my_id")`
按属性找	`soup.find("div", attrs={"data-type": "article"})`

NCBI E-utilities API¶

接口	用途	端点
ESearch	搜索获取 ID 列表	`/entrez/eutils/esearch.fcgi`
EFetch	获取完整记录	`/entrez/eutils/efetch.fcgi`
ESummary	获取摘要信息	`/entrez/eutils/esummary.fcgi`
EInfo	查看数据库信息	`/entrez/eutils/einfo.fcgi`

基础 URL：https://eutils.ncbi.nlm.nih.gov

延伸资源¶

官方文档¶

requests 文档：https://docs.python-requests.org/
BeautifulSoup 文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/
NCBI E-utilities 文档：https://www.ncbi.nlm.nih.gov/books/NBK25497/
GitHub REST API 文档：https://docs.github.com/en/rest

进阶学习方向¶

方向	工具	适用场景
动态页面	Selenium / Playwright	JavaScript 渲染的页面
大规模爬取	Scrapy 框架	需要爬取大量页面
异步请求	aiohttp / httpx	需要高并发
生信专用	Biopython Entrez	NCBI 数据获取的封装库
数据管道	pandas + SQLAlchemy	数据清洗和存储

Biopython Entrez 模块（推荐替代方案）¶

from Bio import Entrez  # Biopython 的 NCBI 接口封装

# 设置邮箱（NCBI 要求提供邮箱，用于联系）
Entrez.email = "your_email@example.com"

# 搜索 PubMed
handle = Entrez.esearch(db="pubmed", term="metagenomics T2D", retmax=5)
record = Entrez.read(handle)   # 解析结果
pmids = record["IdList"]       # 获取 PMID 列表
print(f"PMID 列表: {pmids}")

# 获取论文详情
handle = Entrez.efetch(db="pubmed", id=pmids, rettype="abstract", retmode="xml")
records = Entrez.read(handle)  # 解析 XML 结果

白话总结：如果你主要和 NCBI 打交道，Biopython 的 Entrez 模块比自己写 requests 更方便——它把 API 调用封装好了，你只管调函数就行。

学习路径建议¶

第 1 步：跑通 requests + BeautifulSoup 基础例子
           ↓
第 2 步：用 E-utilities API 爬 PubMed / NCBI Gene
           ↓
第 3 步：学会数据存储（CSV / JSON / SQLite）
           ↓
第 4 步：了解反爬策略和法律合规
           ↓
第 5 步（进阶）：学 Scrapy 框架或 Selenium 处理动态页面