Tuesday, June 02, 2026

貝氏層次模型(Bayesian Hierarchical Model, BHM)與 高維度數據(如轉錄組學、空間統計)

貝氏層次模型(Bayesian Hierarchical Model, BHM)的發展歷史,是一段從「統計學界的邊緣理論」走向「現代數據科學與生物資訊核心」的逆襲史。它的誕生與普及,緊密伴隨著計算機算力的革命以及高維度數據(如轉錄組學、空間統計)的爆發。

以下為您梳理貝氏層次模型的發展軌跡與現代多元的應用範疇:

⏳ 貝氏層次模型的發展歷史

貝氏層次模型的演進可以大致分為四個里程碑階段:

1. 奠基期:詹姆斯-斯坦估計與收縮理論(1950 - 1960年代)

雖然托馬斯·貝氏(Thomas Bayes)在 18 世紀就提出了貝氏定理,但「層次模型」的統計學基礎直到 20 世紀中葉才出現。

  • 關鍵突破:1961 年,統計學家查爾斯·斯坦(Charles Stein)與威拉德·詹姆斯(Willard James)提出了一個震驚統計學界的定理(詹姆斯-斯坦悖論)。他們證明:在估計三個或更多獨立的常態分佈均值時,將所有獨立樣本向全局平均值「收縮(Shrinkage)」後得到的估計值,其總體誤差必定小於傳統的最大概似估計(MLE)

  • 歷史意義:這顛覆了「各人自掃門前雪(獨立估計)」的傳統觀念,為層次模型中「訊息共享」的核心思想奠定了數學正當性。

2. 理論成型期:貝氏架構的全面引入(1970 - 1980年代)

隨後,統計學家發現詹姆斯-斯坦估計可以用貝氏機率的框架給予完美的解釋。

  • 關鍵突破:1970 年代,布萊德利·艾弗隆(Bradley Efron,Bootstrap 的發明者)與卡爾·莫里斯(Carl Morris)發表了一系列論文,正式將收縮理論與貝氏事前分佈結合,展示了如何用「超參數(Hyperparameters)」來建構多層次的統計結構。

  • 歷史意義:此時,層次模型(Hierarchical Model)與經驗貝氏(Empirical Bayes)的理論框架正式成型,統計學家開始意識到這種結構在處理群聚數據(Clustered data)時的巨大潛力。

3. 計算革命期:MCMC 演算法與 BUGS 的誕生(1990年代)

儘管 80 年代框架已成,但當時的貝氏層次模型面臨一個致命瓶頸:無法計算後驗分佈。高維度的多層積分在數學上沒有解析解,使得模型只能停留在理論階段。

  • 計算機革命:1990 年,艾倫·蓋爾芬德(Alan Gelfand)和阿德里安·史密斯(Adrian Smith)將馬可夫鏈蒙地卡羅法(MCMC,特別是吉布斯採樣 Gibbs Sampling)引入貝氏統計,將複雜的高維積分問題轉化為計算機隨機抽樣。

  • 軟體普及:隨後,BUGS(Bayesian inference Using Gibbs Sampling)軟體專案啟動,科學家終於有了工程工具可以真正「執行」貝氏層次模型。BHM 自此進入實用化爆炸期。

4. 現代爆發期:高維數據與機率程式語言(2000年代至今)

進入 21 世紀,隨著基因體學、大數據與人工智慧的興起,數據特徵數(如幾萬個基因)遠大於樣本數的現象成為常態。

  • 現代技術:傳統的吉布斯採樣在高維空間容易卡頓,科學家開發了基於哈密頓力學的 HMC(Hamiltonian Monte Carlo)NUTS(No-U-Turn Sampler) 演算法。

  • 生態圈:這催生了新一代的機率程式語言(PPL),如您代碼中使用的 PyMC,以及 StanPyro。現在,科學家只需要幾行代碼,就能在幾分鐘內完成過去需要幾天才能算完的高維貝氏層次推論。

🌍 貝氏層次模型的現代應用領域

由於 BHM 擅長處理「數據具有層級結構(如:細胞內有基因、學校內有學生、區域內有個體)」以及「小樣本、高雜訊」的問題,它在現代科學中得到了極其廣泛的應用:

1. 生物資訊學與基因體學(Bioinformatics)

正如您目前執行的轉錄組學分析,BHM 是多體學(Multi-omics)數據分析的黃金標準之一。

  • 差異表達分析(Differential Expression):在小樣本(如 $n=3$)下,利用全局基因背景約束個體變異數。經典工具如 limmaDESeq2 的核心數學原理,本質上都是經驗貝氏層次模型(Empirical Bayes)。

  • 單細胞 RNA 測序(scRNA-seq):單細胞數據具有極高比例的「零值雜訊(Dropout)」,BHM 可以透過細胞群體與基因群體的雙重層次,對缺失值進行穩健的填補(Imputation)與降噪。

2. 流行病學與空間疾病地圖(Epidemiology & Spatial Mapping)

在公共衛生領域,科學家需要評估不同行政區的疾病發病率。

  • 小區域估計(Small Area Estimation):某些偏遠小鎮人口極少,只要出現 1 個癌症病例,表面上的發病率就會飆高。

  • 解法:流行病學常用 BYM 模型(Besag-York-Mollié,一種空間貝氏層次模型),讓小鎮的發病率向「全縣平均(全局)」以及「鄰近鄉鎮(空間相關)」進行收縮。這能有效剃除因人口過少產生的統計雜訊,繪製出真正精準的疾病風險地圖。

3. 生態學與環境科學(Ecology)

生態學的野生動物調查往往面臨極端惡劣的觀測條件。

  • 動植物種群動態:科學家在野外架設紅外線相機捕捉動物,觀測數據受到天氣、地形、相機故障等重重干擾。

  • 解法:透過層次模型,第一層建模「動物真實存在的空間分佈(狀態)」,第二層建模「相機捕捉到動物的捕獲機率(觀測雜訊)」。BHM 能夠成功將「生物真實狀態」從「環境干擾雜訊」中剝離出來。

4. 臨床試驗與統合分析(Meta-Analysis)

在醫學藥物開發中,科學家需要綜合評估全球數十個不同中心、不同樣本量發表的臨床報告。

  • 統合分析:每個臨床試驗(Study)都有其特異性(如不同國家、不同年齡層)。

  • 解法:BHM 將每個試驗視為第二層個體,全球總效果視為第一層超事前分佈。它能完美評估「研究間的異質性(Heterogeneity)」,即使某些小規模試驗流於偏差,模型也能給予合理的權重收縮,給出最客觀的藥效評估。

5. 金融、行銷與商業決策

在商業零售中,企業需要預測成千上萬種商品的未來銷量,或是評估不同地區廣告的投放效果。

  • 多層次行銷模型(Hierarchical MMM):將消費者依據「城市」、「年齡層」分層,利用 BHM 既能學到全國消費者的共同行為趨勢(超參數),又能捕捉到特定城市(如台北 vs 高雄)的特異性偏好。這在數據稀疏的細分市場預測中表現極其優異。

📝 歷史與應用的總結

從詹姆斯-斯坦悖論引發的觀念革命,到 MCMC 演算法引爆的計算革命,貝氏層次模型之所以能在當代各大領域成為大宗,是因為它承載了人類對數據認知的轉變:我們不再孤立地看待每一個觀測對象,而是將其納入系統性的全局網絡中。 這種「和而不同、資訊共享」的哲學,使其成為當今應對複雜、高雜訊、小樣本數據最精密的統計武器。

#!/usr/bin/env python3

# -*- coding: utf-8 -*-

"""

Created on Wed Jun  3 09:15:55 2026


核心建立在貝氏層次模型(Bayesian Hierarchical Model)之上,用來處理生物統計學中典型「小樣本、

高維度(高基因數)」的轉錄組特徵選取(Feature Selection)挑戰。


使用貝氏機率理論(Bayesian Inference)進行特徵選取(Feature Selection)是最佳策略。


透過貝氏方法,我們可以利用收縮估計(Shrinkage Estimation)或層次模型(Hierarchical Models)

共享基因間的變異數資訊,從而做出穩健的推論。


執行上述程式後,你會得到一個統計表格,篩選邏輯如下:

1. 核心指標:後驗機率(Posterior Probability)Prob_UP (AT>MT):代表處理後基因表現量確實

上升的客觀機率。如果該值為 0.97,代表有 97% 的把握該基因在處理後被活化。Prob_DOWN (AT<MT):

代表處理後表現量下降的機率。

2. 效果量(Bayesian Effect Size)由於樣本數只有 3 組,有時候雖然平均差值大,但可能只是單一

鰻魚的極端值(離群值)。Bayesian_Effect_Size 納入了變異度($\mu / \sigma$)。數值絕對值

越大(例如 $> 1.5$ 或 $< -1.5$),代表該基因的表達量變化相對於其波動而言非常顯著,發育程度的

差異與該基因高度相關。

3. 如何結合「卵巢發育程度的差異」?如果你有進一步紀錄這三隻鰻魚處理後的卵巢發育成熟度分數(例如:

Fish 1 熟度 80%, Fish 2 熟度 50%, Fish 3 熟度 20%),你可以將模型升級為貝氏線性迴歸模型:

透過檢定斜率 beta_1 的後驗分佈是否遠離 0,就能精準篩選出「其表達量變化與卵巢發育程度呈正/負相關」

的核心關鍵基因。



程式碼在執行時遇到了轉錄組高維度數據分析最常見的兩個嚴重瓶頸:嚴重的發散收斂問題(Divergences)

與 $\hat{R} > 1.01$: 因為每組基因只有 3 個成對樣本(樣本數極少),當你對每個基因「獨立」建

立貝氏模型時,模型很難單憑 3 個點準確估計出該基因的變異度 sigma。這在貝氏統計中被稱為「漏斗效應

(Neal's Funnel)」,會導致 MCMC 採樣器瘋狂報錯、結果不可信。效率極低(維度災難): 你的轉錄

組矩陣包含 21,651 個基因。如果用 for gene in genes: 迴圈一個一個跑 pm.sample(),跑完兩萬

多個基因可能需要耗費幾十個小時甚至好幾天。💡 完美的解決方案:貝氏層次模型(Bayesian 

Hierarchical Model)我們應該利用貝氏的核心優勢——層次模型(Hierarchical Modeling),將所

有基因放進同一個模型中進行矩陣化(Vectorized)平行運算。透過讓所有基因共享同一個上位的事前分佈

(Hyper-priors),模型會自動進行「收縮估計(Shrinkage Estimation)」。白話來說,就是讓表達

量穩定的基因去「幫助」變異大的基因修正其變異數。如此一來:徹底解決少樣本導致的發散(Divergences)

問題。速度提升數千倍(20,000+ 基因可在幾分鐘內透過矩陣運算一次採樣完成)。



@author: yshuang

import numpy as np

import pandas as pd

import pymc as pm

import arviz as az

import time

from pathlib import Path

from typing import Tuple, Optional

import warnings


# ============================================================

# 配置區塊(集中管理所有路徑與參數)

# ============================================================

class Config:

    """集中管理所有配置參數,避免硬編碼分散在代碼中"""

    # 路徑配置

    INPUT_FILE = Path("/Users/yshuang/Documents/Python/FC_GSEA.xlsx")

    OUTPUT_DIR = Path("/Users/yshuang/Documents/Python")

    OUTPUT_CSV = OUTPUT_DIR / "Bayesian_Phenotype_Selection_Results.csv"

    OUTPUT_RNK = OUTPUT_DIR / "Eel_Ovary_AT_vs_MT_Phenotype_Bayesian.rnk"


    # 數據配置

    SHEET_NAME = "data (TPM)"

    USE_COLS = "A, E:J"

    N_SAMPLES_PER_GROUP = 3  # 每組樣本數


    # 💡【外表型差型定量變數】

    # 根據 1440 > 2613 >> 2003 邏輯賦予發育推進權重 (AT - MT 的表型變化量)

    # 這裡的 Key 必須與您的 Excel 樣本名稱中的個體編號完全對應

    PHENOTYPE_DELTA_MAP = {

        "1440": 3.0,   # 強

        "2613": 2.0,   # 中

        "2003": 0.5    # 弱

    }


    # MCMC 配置

    DRAWS = 1000

    TUNE = 1000

    CHAINS = 2

    TARGET_ACCEPT = 0.95

    RANDOM_SEED = 42


    # 特徵選取門檻 (改看斜率為正/負的後驗機率)

    PROB_THRESHOLD = 0.95

    TOP_N_DISPLAY = 10



def load_and_preprocess_data(config: Config) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, np.ndarray, np.ndarray, list]:

    """

    讀取轉錄組數據,並提取與個體嚴格對齊的外表型差值變數

    """

    if not config.INPUT_FILE.exists():

        raise FileNotFoundError(f"❌ 輸入檔案不存在: {config.INPUT_FILE}")


    print(f"📂 讀取數據: {config.INPUT_FILE}")


    df = pd.read_excel(

        config.INPUT_FILE,

        sheet_name=config.SHEET_NAME,

        index_col=0,

        usecols=config.USE_COLS

    )

    df.columns = df.columns.str.strip()


    if df.shape[1] == config.N_SAMPLES_PER_GROUP * 2:

        Y_raw = df.T.astype(float)

    else:

        Y_raw = df.astype(float)


    print(f"✅ 基因矩陣載入完成: {Y_raw.shape[0]} 樣本 × {Y_raw.shape[1]} 基因")


    Y_asinh = np.arcsinh(Y_raw)


    mt_samples = sorted([col for col in Y_asinh.index if 'MT' in col.upper()])

    at_samples = sorted([col for col in Y_asinh.index if 'AT' in col.upper()])


    if len(mt_samples) != len(at_samples) or len(mt_samples) == 0:

        raise ValueError("❌ 樣本配對失敗,請檢查 MT 與 AT 命名。")


    print(f"📊 已成功配對樣本: 對照組 {mt_samples} ↔️ 處理組 {at_samples}")


    # 計算成對差值

    data_mt = Y_asinh.loc[mt_samples].values

    data_at = Y_asinh.loc[at_samples].values

    diff_matrix = data_at - data_mt

    X_data = diff_matrix.T  # (n_genes, n_samples)

    genes = Y_raw.columns.tolist()


    # 💡【核心新增】:動態解析樣本名稱,並匹配外表型定量值

    phenotype_list = []

    for sample_name in mt_samples:

        matched = False

        for key, val in config.PHENOTYPE_DELTA_MAP.items():

            if key in sample_name:

                phenotype_list.append(val)

                matched = True

                break

        if not matched:

            raise ValueError(f"❌ 樣本 {sample_name} 無法匹配到任何外表型設定,請檢查 Config.PHENOTYPE_DELTA_MAP")

    

    X_phenotype = np.array(phenotype_list)

    print(f"📈 樣本對齊的外表型推進值 (ΔPhenotype): {dict(zip(mt_samples, X_phenotype))}")


    diff_df = pd.DataFrame(diff_matrix, index=mt_samples, columns=genes)


    return Y_raw, Y_asinh, diff_df, X_data, X_phenotype, genes



def build_bayesian_regression_model(X_data: np.ndarray, X_phenotype: np.ndarray, config: Config) -> az.InferenceData:

    """

    建立並執行貝氏層次線性迴歸模型

    使得基因表達量差值 (Y) 與外表型推進值 (X) 進行耦合

    """

    n_genes, n_samples = X_data.shape


    print(f"🚀 啟動貝氏層次迴歸 MCMC 採樣...")

    start_time = time.time()


    with pm.Model() as hierarchical_regression_model:

        # --- 1. 迴歸截距 (Intercept) 層次分佈 ---

        alpha_global = pm.Normal("alpha_global", mu=0, sigma=2)

        sigma_alpha = pm.HalfNormal("sigma_alpha", sigma=2)

        alpha_offset = pm.Normal("alpha_offset", mu=0, sigma=1, shape=n_genes)

        alpha_genes = pm.Deterministic("alpha_genes", alpha_global + alpha_offset * sigma_alpha)


        # --- 2. 迴歸斜率 (Slope, 對外表型的敏感度) 層次分佈 ---

        # beta_global 代表全轉錄組受到該外表型驅動的平均斜率趨勢

        beta_global = pm.Normal("beta_global", mu=0, sigma=2)

        sigma_beta = pm.HalfNormal("sigma_beta", sigma=2)

        beta_offset = pm.Normal("beta_offset", mu=0, sigma=1, shape=n_genes)

        

        # beta_genes 代表每個基因各自對外表型發育強弱的「響應斜率」

        beta_genes = pm.Deterministic("beta_genes", beta_global + beta_offset * sigma_beta)


        # --- 3. 群體共享的殘差變異度 ---

        sigma_noise = pm.HalfNormal("sigma_noise", sigma=2)


        # --- 4. 線性模型方程式: μ = 基礎差值 + 斜率 * 外表型變化量 ---

        mu_predicted = alpha_genes[:, None] + beta_genes[:, None] * X_phenotype[None, :]


        # 似然函數

        likelihood = pm.Normal(

            "y_obs",

            mu=mu_predicted,

            sigma=sigma_noise,

            observed=X_data

        )


        # 執行採樣

        trace = pm.sample(

            draws=config.DRAWS,

            tune=config.TUNE,

            chains=config.CHAINS,

            target_accept=config.TARGET_ACCEPT,

            return_inferencedata=True,

            random_seed=config.RANDOM_SEED,

            progressbar=True

        )


    elapsed = time.time() - start_time

    print(f"🎉 MCMC 完成!耗時: {elapsed:.1f} 秒")


    return trace



def extract_posterior_statistics(

    trace: az.InferenceData,

    X_data: np.ndarray,

    genes: list,

    config: Config

) -> pd.DataFrame:

    """從迴歸模型的後驗分佈提取指標(改以斜率 Beta 為核心)"""

    print("📊 計算外表型相關後驗統計量...")


    # 提取響應外表型的斜率後驗樣本 (n_genes, n_draws)

    beta_samples = az.extract(trace, var_names="beta_genes").values

    alpha_samples = az.extract(trace, var_names="alpha_genes").values

    sigma_noise_samples = az.extract(trace, var_names="sigma_noise").values


    # 計算統計量

    mean_diff = X_data.mean(axis=1)

    post_beta_mean = beta_samples.mean(axis=1)

    post_alpha_mean = alpha_samples.mean(axis=1)


    # 💡 計算該基因與「外表型推進」呈正相關或負相關的後驗機率

    prob_up = np.mean(beta_samples > 0, axis=1)

    prob_down = np.mean(beta_samples < 0, axis=1)


    # 貝氏效果量:以斜率(信噪比)作為核心

    sigma_broadcast = np.tile(sigma_noise_samples, (len(genes), 1))

    epsilon = 1e-10

    effect_size = np.mean(beta_samples / (sigma_broadcast + epsilon), axis=1)


    df_results = pd.DataFrame({

        "Raw_Mean_Diff": mean_diff,

        "Bayesian_Intercept(Alpha)": post_alpha_mean,

        "Bayesian_Slope(Beta)": post_beta_mean,

        "Prob_Phenotype_Pos(Beta>0)": prob_up,

        "Prob_Phenotype_Neg(Beta<0)": prob_down,

        "Bayesian_Effect_Size": effect_size

    }, index=genes)


    # 特徵選取:斜率方向信心大於門檻者

    df_results["Selected"] = (

        (df_results["Prob_Phenotype_Pos(Beta>0)"] > config.PROB_THRESHOLD) |

        (df_results["Prob_Phenotype_Neg(Beta<0)"] > config.PROB_THRESHOLD)

    )


    # 按表型驅動效果量絕對值排序

    df_results = df_results.sort_values(

        by="Bayesian_Effect_Size",

        key=lambda x: np.abs(x),

        ascending=False

    )


    print(f"\n--- 前 {config.TOP_N_DISPLAY} 個強烈受外表型驅動基因 ---")

    print(df_results.head(config.TOP_N_DISPLAY).to_string())


    n_selected = df_results["Selected"].sum()

    print(f"\n📈 特徵選取摘要: {n_selected}/{len(genes)} 基因與外表型發育軌跡強烈耦合 (門檻 {config.PROB_THRESHOLD*100:.0f}%)")


    return df_results



def export_gsea_ranklist(df_results: pd.DataFrame, config: Config) -> None:

    """匯出基於外表型迴歸斜率排序的 GSEA .rnk 檔案"""

    print("\n🚚 生成外表型驅動型 GSEA Rank List...")

    config.OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


    df_gsea = df_results[["Bayesian_Effect_Size"]].copy().dropna()

    if df_gsea.index.duplicated().any():

        df_gsea = df_gsea[~df_gsea.index.duplicated(keep="first")]


    df_gsea = df_gsea.reset_index()

    df_gsea.columns = ["Gene", "Score"]

    df_gsea = df_gsea.sort_values(by="Score", ascending=False).reset_index(drop=True)


    df_gsea.to_csv(config.OUTPUT_RNK, sep="\t", header=False, index=False)

    print(f"✅ GSEA .rnk 匯出成功: {config.OUTPUT_RNK}")



def run_diagnostics(trace: az.InferenceData) -> None:

    """收斂診斷"""

    print("\n🔍 收斂診斷...")

    summary = az.summary(trace, var_names=["alpha_global", "beta_global", "sigma_noise"])

    print("\n--- 全局超參數迴歸收斂摘要 ---")

    print(summary[["mean", "sd", "r_hat"]].to_string())



def main():

    config = Config()

    try:

        # 1. 數據載入與預處理 (包含外表型對齊)

        Y_raw, Y_asinh, diff_df, X_data, X_phenotype, genes = load_and_preprocess_data(config)


        # 2. 建立貝氏層次迴歸模型

        trace = build_bayesian_regression_model(X_data, X_phenotype, config)


        # 3. 收斂診斷

        run_diagnostics(trace)


        # 4. 提取迴歸統計量

        df_results = extract_posterior_statistics(trace, X_data, genes, config)


        # 5. 匯出完整結果

        df_results.to_csv(config.OUTPUT_CSV)

        print(f"\n💾 迴歸分析結果已儲存: {config.OUTPUT_CSV}")


        # 6. 匯出 GSEA 檔案

        export_gsea_ranklist(df_results, config)

        print("\n🎊 外表型耦合分析全數完成!")


    except Exception as e:

        print(f"\n❌ 執行錯誤: {e}")

        raise



if __name__ == "__main__":

    main()

Tuesday, May 26, 2026

Scientists stunned by ‘fundamentally new way’ life produces DNA

Scientists stunned by ‘fundamentally new way’ life produces DNA

Bacterial system uses protein in novel way to build mysterious repetitive DNA sequence that defends against viruses

paired strands of DNA (orange and cyan) are synthesized by two enzymes
In a newly discovered bacterial defense system, paired strands of DNA (orange and cyan) are synthesized by two enzymes: One (yellow) uses an RNA template (beige) to guide the assembly of the nucleotide bases that make up DNA, while a second, highly unusual enzyme (light blue) uses its own amino acids as a template.Hyunbin Lee

For decades, biology textbooks have enshrined a simple rule: DNA is made by copying a template. After one enzyme unzips a DNA double helix into separate strands, another called a polymerase builds a complementary sequence, base by base, for each strand. Presto: two copies of the original DNA. But new research into how bacteria defend themselves from viruses now shows this synthesis rule isn’t absolute. Today in Science, a Stanford University team describes a bacterial enzyme that synthesizes a long repetitive DNA sequence without a nucleic acid template, using its own structure as a guide.

“The research is groundbreaking,” says Philip Kranzusch, a biochemist at Harvard Medical School who studies bacterial defenses. “Pretty cool!” adds Adi Millman, a computational biologist at the Massachusetts Institute of Technology. The use of a protein as a template for DNA synthesis, she says, “is a meaningful conceptual shift from the classical central dogma,” in which information flows in one direction from nucleic acids like DNA to protein. Other biologists take issue with the idea that dogma has been challenged, arguing the mysterious DNA sequence made by the enzyme doesn't then become part of the bacterium's genome. Still, the Stanford team and other scientists hope the novel form of DNA synthesis can be adapted as a tool for basic biological research, much like the powerful genome editor CRISPR was developed from another bacterial defense system.

In canonical DNA replication, the rules of base pairing reign supreme: Polymerases assemble their complementary DNA strand by matching adenine with thymine and guanine with cytosine on the template. Replication can also proceed with RNA as the template, thanks to polymerases called reverse transcriptases that use that nucleic acid to guide the fabrication of single-stranded DNA.

The new finding centers on DRT3, a defense system that protects bacteria from viruses, known as phages, that infect them. DRT3, the researchers found, bypasses the logic of base pairing. It relies on two reverse transcriptases: a conventional one that builds single-stranded DNA from an RNA template, and a second, unusual one that assembles its complement from its own built-in template. This unusual enzyme, called Drt3b, has amino acids in its active site that mimic a template RNA strand.

“The protein itself serves as the blueprint for the DNA sequence,” says Stanford biochemist Alex Gao, senior author on the study. “That was quite a surprise,” he says. “This is a fundamentally new way that life produces DNA.” Yet Gao acknowledges that Drt3b only makes a single, specific repetitive sequence. "It does not represent a general mechanism for proteins to write genetic code," he says.

The DRT3 system appears to be widespread across bacteria, suggesting it is not a biochemical curiosity. Yet how it thwarts phages is still a mystery.

One possibility, Gao says, is that DNA helices made by this unique replication method act as molecular sponges that glom onto phage components, either directly hindering the phage or enabling other bacterial immune elements to recognize the infection. If that idea holds up, Kranzusch says, DRT3 would complement recent discoveries of polymeraselike proteins in other bacterial defense systems that produce nucleic acid polymers to detect and inhibit phage infection.

DRT3 also represents another mind-bending role for reverse transcriptases, long associated with retroviruses such as HIV, which uses one to synthesize a DNA copy of its RNA genome and slip into a cell’s chromosomes. In recent years, these enzymes have been revealed to be key players in some CRISPR bacterial defense systems and in the generation of entirely new bacterial genes. RTs are now appreciated as “highly adaptable scaffolds that have been repeatedly co-opted” for functions beyond DNA replication, Gao says.

Like CRISPR, DRT3 could have practical applications. “DRT3 represents an ‘all-in-one’ molecular machine for sequence-specific DNA synthesis, which is a rare find in nature,” Gao says. If scientists could figure out how to engineer Drt3b to produce other sequences, he adds, they might make customized DNA strands, for instance to create advanced biomaterials such as DNA hydrogels.

More broadly, the discovery underscores how much remains hidden in microbial biology. DRT3, Gao says, should be viewed as “a catalyst to re-examine the dark matter of the microbial world.” And with numerous bacterial defense systems still uncharacterized, adds Aude Bernheim, a microbiologist at the Pasteur Institute, “it’s fantastic to imagine that many of these encode exotic biochemical functions like the one uncovered here.”

Update, 22 April, 10:25 a.m.: This story has edited to clarify the type of DNA made by the bacterial enzyme and that it doesn’t “rewrite” the genetic code as noted in an added quote

Monday, May 04, 2026

Fertility Decline So Fast? The Key Is the Ovary

https://www.ucsf.edu/news/2025/10/430841/why-does-female-fertility-decline-so-fast-key-ovary

Why Does Female 


Fertility Decline So Fast? The Key Is the Ovary

With a new imaging technique, scientists discover an ecosystem that determines how eggs mature and ovaries age.

By Sarah C.P. Williams

The ticking of the biological clock is especially loud in the ovaries — the organs that store and release a woman’s eggs. From age 25 to 40, a woman’s chance of conceiving each month decreases drastically.

For decades, scientists have pointed to declining egg quality as the main culprit. But new research from UC San Francisco and Chan Zuckerberg Biohub San Francisco shows that the story is bigger than the eggs: The surrounding cells and tissues of the ovary play a crucial role in how eggs mature and how quickly fertility wanes. The work is supported by the National Institutes of Health (NIH).

... Ovarian aging is not just about the egg cells but about their whole ecosystem.

Diana Laird, PhD

“We’ve long thought of ovarian aging as simply a problem of egg quality and quantity,” said Diana Laird, PhD, professor of Obstetrics, Gynecology & Reproductive Sciences at UCSF and senior author of the study, which appears in Science on Oct. 9. “What we’ve shown is that the environment around the eggs — the supporting cells, nerves, and connective tissue — is also changing with age.”

Understanding these changes may hold the key not only to extending fertility, but also to improving health. The risks of many age-related diseases rise after menopause or ovary removal, and slowing ovarian aging could help reduce these risks.

“By combining the Laird lab’s cutting-edge imaging with the Biohub’s expertise in two kinds of single-cell sequencing, we were able to understand the ovary in unprecedented detail,” said Norma Neff, PhD, director of the Genomics Platform at the San Francisco Biohub, who collaborated with Laird on the work. “This technology-driven approach let us uncover new cell types, providing a foundation for future discoveries in reproductive health.”

The number of eggs (green) decline with age. Growing eggs are shown in magenta. At left is a 2-month-old mouse. At right is a 12-month-old mouse. Images by Gaylord, et al.

Image
A microscopic image of a mouse ovary at 2 months, showing a large presence of eggs.
Image
A microscopic image of a mouse ovary at 12 months, showing that the amount of eggs has decreased.


It takes an entire ecosystem to raise an egg

Laird and her colleagues set out to profile what normal aging looks like in the ovaries of mice and humans. First, they developed a new three-dimensional imaging technique that allowed them to visualize eggs in the ovaries without having to slice the organs into thin layers, as had been done before.

In mice that were the equivalent of 30 to 40 human years, they observed a dramatic drop in both immature resting eggs that are waiting in reserve and in growing eggs that are beginning to mature for ovulation. And just like women in their 30s, the mice did not conceive easily with in vitro fertilization (IVF).

When the scientists extended their 3D imaging to human ovaries, they uncovered an unexpected finding: Eggs are not evenly scattered throughout the ovary. Instead, they cluster in “pockets” surrounded by egg-free zones. With age, the density of eggs within these pockets declines.

“This was a surprise. We assumed eggs would be distributed more evenly based on what we see in the developing ovary,” said Laird, who is a Biohub investigator and a member of the Eli and Edythe Broad Center of Regeneration Medicine at UCSF. “These pockets suggest that even within one ovary, the environment around an egg may influence how long it lasts and how well it matures.”

A 3D view of a whole mouse ovary, with every egg marked in green and the growing egg follicles in magenta.


New role for the nervous system in ovarian health

Next, the researchers teamed up with Neff’s group at the Biohub to study what genes were active in ovary cells as they aged. Ovarian tissue from humans is hard to come by, and eggs are large and incredibly fragile. So, instead of using standard miniature devices that separate and tag cells to sequence their active genes, the group painstakingly isolated individual eggs by hand to separate them from other cells.

After studying nearly 100,000 mouse and human cells, they identified 11 major cell types found in the ovaries, including one surprise: Glia, a type of support cell typically associated with nerves and most extensively studied in the brain, were in the ovaries.

At the same time, the study revealed that sympathetic nerves — the same nerves involved in the “fight or flight” response — form dense networks in ovaries that become even more dense with age. When the researchers ablated these nerves in mice, the animals had more eggs in reserve but fewer that matured, suggesting the nerves help decide when eggs start growing. Together, the observations on glia and sympathetic nerves suggest a new role for the nervous system in ovarian health.

A human ovary at age 23 (left) and age 55 (right). Sympathetic nerves, involved in the “fight or flight” response, are shown in white and increase with age. Images by Gaylord, et al.

Image
A microscopic image showing few sympathetic nerves in an ovary at 23 years of age.
Image
A microscopic image showing a larger presence of dense sympathetic nerve cells in an ovary at age 55.

Other support cells called fibroblasts also changed with age, triggering inflammation and scarring in the ovaries of women in their 50s — years earlier than such scarring appears in organs like the lungs or liver.

“This all points to a brand-new line of inquiry about how nerves, blood vessels, and other cell types communicate with eggs,” Laird said. “It tells us that ovarian aging is not just about the egg cells but about their whole ecosystem.”

Implications for fertility and beyond

For researchers, one of the most important takeaways of the new work is the similarity between human and mouse ovaries.

“Until now, it was somewhat unclear whether we could use mice as a model for humans when it comes to the ovaries — we have quite different reproductive windows,” Laird said. “But the similarities we saw in this study make us confident that we can move forward in mice and apply those lessons to humans.”

In addition, the new roadmap of healthy ovaries over time offers a starting place to ask how ovarian aging changes in different situations. Laird’s team is already launching studies probing whether some drugs could change the timing or speed of ovarian aging, she said. Ultimately, they hope to uncover ways to slow or delay ovarian aging, to impact both fertility and other diseases, like cardiovascular disease, which are common in women after menopause.

“The fountain of youth may actually be the ovary,” said Eliza Gaylord PhD, a postdoctoral fellow at UCSF who is co-first author of the study. “Delaying ovarian aging could promote healthier aging overall.”

Authors: Other authors of the study are Mariko H. Foecke, PhD, Ryan M. Samuel, PhD, Tara I. McIntyre, PhD, Juan Du, James M. Gardner, MD, PhD, and Faranak Fattahi, PhD, of UCSF; Angela M. Detweiler, Leah C. Dorman, Michael Borja, and Ritwicq Arjyal of Chan Zuckerberg Biohub San Francisco; Bikem Soygur, PhD, of the Buck Institute for Aging; and Amy E. Laird, PhD, of Oregon Health and Science University.

Funding: This work was funded by the National Institutes of Health (1F31HD108875, 1F31HD110208, 1R01GM122902, 1R01ES023297, P30ES030284), a UCSF Discovery Fellowship, a Hillblom/BARI Graduate Student Fellowship Award, CZ Biohub Investigator funds, The Global Consortium for Reproductive Health through the Bia-Echo Foundation (GCRLE-0123), the W.M. Keck Foundation, the Simons Foundation International, the Juno Fund, and individual donors including Mary Linda Laird, Robert and Mary Laird, and Nikki J. Zapol.