Main

Deuterostomes comprise a major group of animals that share several morphological and developmental features, including radial cleavage, deuterostomy, formation of the mesoderm by enterocoely, mesoderm-derived skeletal tissues and the presence of pharyngeal openings or slits1,2,3. Although their monophyly has recently been challenged by phylogenomic studies4,5, current consensus still groups them together as one of two major bilaterian groups alongside protostomes. Recent comparative studies of gene expression dynamics suggests conservation of key developmental mechanisms throughout metazoan life cycles6; therefore, as a major bilaterian group, knowledge of the genetic controls underlying deuterostome development can potentially provide insights into the nature of both the deuterostome and bilaterian last common ancestors (LCA). Finding appropriate deuterostome models is thus key to shedding light into the genetic mechanisms underlying the diversification of body plans and life strategies in this group in relation to the deuterostome and the bilaterian LCAs.

There are three deuterostome groups: chordates, echinoderms and hemichordates. The last two form a group called Ambulacraria based on molecular data, as the sister group to chordates7,8,9. Despite morphological differences of the adult body plans, several hemichordate groups undergo indirect development with free-swimming, planktonic larvae such as those of most echinoderms10,11. The aforementioned molecular similarities are paralleled by similar patterning mechanisms, complemented with conserved expression patterns of key regulatory genes, which have been shown to control the formation of the dorsoventral (DV) axis, mesoderm, nervous system and primordial germ cells in echinoderm and hemichordate embryos and larvae12,13,14,15,16,17. In contrast to pentaradial symmetry in adult echinoderms18,19, hemichordates share bilateral symmetry with chordates, with a distinctive tripartite body plan of proboscis, collar and trunk20,21. They also show conserved structures such as pharyngeal gill slits, which is probably a synapomorphy of deuterostomes22,23, as well as a complex organization of the nervous system far from a simple nerve net24. Importantly, several hemichordate structures exhibit conserved patterning mechanisms during development. These include the overall anteroposterior (AP) axis regionalization25 and the AP patterning of the ectoderm in the proboscis and collar by signalling organizing centres similar to those of the vertebrate central nervous system26 and the formation of coelomic cavities from an endomesoderm region27,28. The patterning mechanism of the DV polarity is also similar, although the DV axis is inverted in chordates17,29. Other functionally analogous structures with uncertain homology include a collar nerve cord which follows a morphogenesis process similar to chordate neurulation30,31 but using different genes for neuronal patterning32. These similarities could potentially extend to the cis-regulatory machinery33. Whether the deuterostome ancestor was a direct or indirect developer has been controversial but, in any case, echinoderm pentaradial symmetry was probably secondarily acquired and indirect-developing hemichordates could help reconstruct the origins and the ancestor of this group and elucidate this question.

However, despite remarkable efforts in elucidating the genomic complexity and patterning mechanisms in hemichordates, a comprehensive understanding of the hemichordate developmental programs and how the genome is temporally regulated is still lacking. This also includes assessing whether a transcriptionally comparable developmental period can be recognized in deuterostomes and to what extent the developmental gene regulatory networks (GRNs) are conserved among different deuterostome groups34. While there have been efforts to begin the functional characterization of cis-regulatory elements in hemichordates such as Saccoglossus kowalevskii33,35, the current lack of cis-regulatory information for indirect-developing hemichordates, as well as cis-regulatory data spanning the whole genome, precludes further comparisons to assess the degree of conservation among different deuterostome lineages.

To address these questions, we characterized developmental transcriptomes and chromatin accessibility of Ptychodera flava, an indirect-developing hemichordate, based on its new chromosomal-level genome assembly36. We discovered that P. flava exhibits a biphasic developmental control with two major transcriptional programs which are regulated by different classes of transcription factors (TFs). Comparative developmental transcriptomics and GRN analyses suggest that gastrulation represents the stage of highest molecular resemblance of deuterostomes. Additionally, we validated our data exploring the cis-regulatory activities of open chromatin regions in the vicinity of the gene encoding FoxA, one of the main TFs with conserved gene expression across similar deuterostome stages and also a driver for the gastrula cis-regulatory landscape based on our analyses. Overall, our results suggest that gastrulation is controlled by conserved gene regulatory circuits in deuterostomes and align with recent efforts in characterizing the dynamics of gene expression across metazoan development and life cycles, providing a global view for understanding the conservation of the molecular basis underlying deuterostome development.

Results

Two transcriptional programs during P. flava development

Indirect-developing hemichordates go through several morphologically distinct developmental stages. After two days of embryogenesis, late gastrulae of P. flava hatch and undergo a series of tornaria larval stages that stay in the water column for months before metamorphosing into a worm-like body plan (Fig. 1a). We analysed gene expression profiles using RNA-seq at 16 stages, covering the entire P. flava life cycle, to understand the dynamics of gene expression during its development (Fig. 1a,b and Supplementary Fig. 1). Overall, we identified 28,413 genes expressed across all stages (Supplementary Data 1). Similarities at the whole-transcriptome level suggest two independent programs of gene expression at early and late development, with late gastrula (LG) and the Muller stage (the newly hatching tornaria) as the transitional point (Fig. 1c and Supplementary Fig. 1g,h). Differential gene expression analysis (DGEA) throughout development identified 23,726 dynamically expressed genes (hereafter referred to as temporal genes), which represent 83% of all genes detected and can be grouped into 22 clusters on the basis of their temporal gene expression profiles (Fig. 1d, Supplementary Fig. 1I and Supplementary Data 1). Several of these gene clusters clearly correspond to their deployment at different developmental phases (for example, clusters 1 and 2 match to maternal factors used in early embryogenesis; clusters 6–8 are activated during embryogenesis; clusters 15–19 are generally deployed during the larval phase; clusters 20–22 are predominantly used during metamorphosis and in adults). We noted that genes expressed at the Muller stage (Fig. 1d) include clusters deployed mainly during gastrulation (clusters 12 and 13) and later larval stages (clusters 15 and 16). This observation reinforces that Muller stage represents a turning point between embryonic and larval transcriptional states, since its transcriptomic profile is essentially using embryonic-specific genes but starting to shift to a ‘larval-like’ transcriptomic program. Gene ontology (GO) enrichment analysis of these clusters revealed major developmental processes taking place at the time of peak expression (Supplementary Data 2). For example, we found GO terms related to RNA processing, transcriptional activation and mitosis during early embryogenesis (for example, clusters 2 and 5–8), organ development during gastrulation and early larval periods (cluster 10), as well as cilia and axon formation at gastrula and larval stages (clusters 13–16) (Supplementary Fig. 2 and Supplementary Data 3). These results corroborate our understanding of the temporal controls of transcriptional programs orchestrating different developmental processes and are in line with our current understanding of metazoan developmental transcriptomics6.

Fig. 1: Developmental transcriptomics of P. flava.
figure 1

a, Life cycle of P. flava and sample collection. Grey dashed line represents the time of hatching and inner coloured lines represent the duration of the embryonic and larval transcriptome. Illustrations adapted from refs. 9,22,111. b, PCA analysis of the developmental transcriptomes. The text colours correspond to the colour illustrations of different stages in a. c, Pairwise Spearman correlation of transcriptomes of each stage (quantile normalized, averaged replicates) revealed two major transcriptional programs (phases) during development: embryonic and larval development. d, The 22 clusters of dynamically expressed genes among the 16 stages. Average gene expression of all the genes in each cluster. The numerical order of the clusters is based on timing of maximal peak expression throughout development. e,f, The two phases are orchestrated by distinct sets of TF classes. e, Heatmap showing Chi-squared residuals of over- or under-representation of TF classes throughout development. Relative expression (transcripts per million per gene) of each major expressed class during development. f, Co-expression of TFs and temporal genes. For a given TF class X and cluster Y, fraction of X genes with a one-tailed Spearman correlation >0.6 with the average expression profile of genes in cluster Y.

To understand how the transcriptional programs are controlled, we focused on the usage of TFs during P. flava development (Fig. 1e and Extended Data Fig. 1). For an approach to identify the underlying gene expression regulatory machinery, we classified the TFs into major classes (following animalTFDB and automated orthology relationships with vertebrates) and looked at their expression throughout development (Supplementary Data 1 and 4). Chi-squared analysis of TF class proportions in each developmental stage suggests a gradual but biphasic program of TFs, with over-representation of E2F, HMG (including Sox), homoeodomain Pou and T-box TFs that are maternally deposited and used at early stages and later classes such as bZIP, bHLH, DM and RXR (Fig. 1e); some classes such as homoeodomain and bHLH showed a more gradual incorporation to late development. The timing of these TFs generally corresponds to their roles in development in other species. For example, the Pou and T-box (including Brachyury) classes, mediate pluripotency and cell fate specification in early stages and gastrulation37,38,39,40; we also observed TFs such as one steroidogenic factor with an increase in gene expression in larval stages, which suggests steroidogenic signalling and gonad maturation function at later stages41. Additionally, several TF classes exhibit high agreement in co-expression with temporal gene clusters throughout development, supporting that the two distinct developmental programs of embryonic and larval development are controlled by different TF combinations (Fig. 1f). Notably, RXR (class 22) and thyroid hormone (THR) (class 27) pathways are involved in metamorphosis in a variety of animals42,43 and they are co-expressed with clusters 20 and 21, genes predominantly expressed during P. flava metamorphosis (red asterisks in Fig. 1f). This observation suggests a similar role of RXR and THR functions during P. flava metamorphosis. In sum, these results indicate that the two major developmental programs of P. flava are controlled by two sets of TFs to perform temporal processes known in metazoan development, including other deuterostomes.

Gastrulation is the most similar stage in deuterostomes

Next, we sought to understand the transcriptional dynamics of P. flava in the context of deuterostome evolution. For this, we compared the transcriptome of P. flava against those of four other deuterostomes, plus one protostome as an outgroup (Fig. 2; see Methods and Extended Data Fig. 1 for a graphical summary; Extended Data Figs. 28, Supplementary Figs. 3 and 4, Supplementary Note 1 and Supplementary Data 58). Comparing the transcriptional dynamics of P. flava against other deuterostome species revealed similarities at mid and mostly late development. The most similar stages between hemichordate and echinoderms (the sea urchin Strongylocentrotus purpuratus and the crinoid Anneissia japonica) were larval stages (Fig. 2a), with genes enriched in signal transduction and metabolism (Extended Data Figs. 2 and 3). This suggests that the LCA of Ambulacraria possessed larval stages. We noted that the hemichordate early gastrula transcriptome bears high resemblance to those of the sea urchin mesenchyme blastula and early gastrula stages; the former stage represents the initial gastrulation process in sea urchins when the primary mesenchyme cells ingress into the blastocoel44. Genes in common between these stages were enriched in neurogenesis and germ layer fate commitment (Extended Data Figs. 2b,c and 3b,c). Comparison with cephalochordates (Branchiostoma lanceolatum and Branchiostoma floridae) showed that hemichordate mid development has higher resemblance to cephalochordate neurulation than to mid or late developmental stages of echinoderms, when comparing the three lineages using a set of 3,192 common orthologues predicted to have been present in the deuterostome LCA (Fig. 2a,b, Extended Data Fig. 6 and Supplementary Data 7). This suggests that the set of predicted deuterostome LCA orthologues might have retained their gastrulation transcriptional dynamics in hemichordates and cephalochordates. Genes expressed in common between hemichordate and cephalochordate mid development were enriched in neuron migration and cell fate commitment but also AP patterning (Extended Data Figs. 4 and 5), probably reflecting the relative timing of comparable developmental events such as neurogenesis and mesoderm/endoderm separation. Hemichordate and cephalochordate late development stages also showed similarities and were enriched in genes related to metabolism and signal transduction (Extended Data Figs. 4 and 5).

Fig. 2: Comparative transcriptomics and gene age analysis.
figure 2

a,b, Transcriptional dynamics of deuterostomes, using a common set of 3,192 orthologues found in the three species. a, Mean Jensen–Shannon distance (JSD) (bootstrap 1,000 iterations) of pairwise comparisons between P. flava and sea urchin S. purpuratus developmental transcriptomes. Illustrations adapted from refs. 9,49. b, Same analysis for P. flava and amphioxus B. floridae developmental transcriptomes. c, Expression of TFs during ambulacraria gastrulation and cephalochordate neurulation. Gene symbols in black indicate that gene names come from in situ hybridization orthology; gene symbols in grey indicate eggNOG-predicted name. d, Jensen–Shannon distance (bootstrap 1,000 iterations) of pairwise comparisons between P. flava and O. fusiformis developmental transcriptomes using 1:1 orthologues. Illustrations adapted from refs, 6,9. e, TAI throughout the development of P. flava. f, Gene age enrichment of temporal clusters (two-tailed Chi-squared test). Of, O. fusiformis; Pf, P. flava.

To better understand the underlying molecular mechanisms driving these similarities in gene expression, we looked at the dynamics of TFs in these species. We noticed a trend of similar expression of Pou, HMG and T-box classes in early and mid development and bZip and Homeobox in late development (Supplementary Fig. 4a,b), roughly matching similar observations outside deuterostomes45. Transcriptional dynamics of pairwise orthologous TFs also showed similarities at late but also early and mid development (Supplementary Fig. 4c,d). Using a set of 234 orthologous TFs predicted to have been in the deuterostome LCA, we detected groups of TFs with distinct patterns of temporal dynamics across deuterostomes (Supplementary Fig. 4e–g, Extended Data Fig. 7 and Supplementary Data 8). The vast majority of these TFs retain their relative timing of expression during development and we noticed a group of TFs with high expression in ambulacraria gastrulation and cephalochordate neurula stages, including foxa, nkx2.1, nkx2.2, brachyury and foxg (Fig. 2c and Extended Data Figs. 7 and 8); some have been deployed differently in different deuterostomes (such as otp in echinoderms). The observation that amphioxus neurula stages are transcriptomically similar to hemichordate gastrula probably reflects the exact timings of comparable developmental events; for example, neurogenesis and separation of mesoderm and endoderm occur at the mid gastrula in P. flava and during neurulation in amphioxus46,47. Transcriptomic similarity between hemichordate gastrula and amphioxus neurula may be attributed to core sets of TFs regulating the conserved developmental events in the two lineages.

Comparing P. flava against the annelid Owenia fusiformis as an outgroup of deuterostomes revealed similarities between hemichordate gastrula stages and the annelid gastrula, elongation and early larva stages (Fig. 2d and Extended Data Fig. 8a), with genes related to skeletal morphogenesis, neuron fate commitment and AP patterning (Extended Data Fig. 8b,c and Supplementary Data 6). Hemichordate larval stages also showed similarity to O. fusiformis mitraria and competent stages, with similar patterns of signal transduction and metabolism (Extended Data Fig. 8d–f). We also observed the maximal similarity of orthologous TF expression around gastrulation and elongation (Extended Data Fig. 8g). The maximal similarity between a deuterostome and a protostome being in gastrulation aligns with our observations that the transcriptomic age indexes (TAI) of gastrula stages in P. flava are relatively older than the other stages (Fig. 2e), supported also by the observation that genes deployed during embryogenesis were enriched in more ancient genes (Fig. 2f). Overall, these results suggest that gastrulation is the stage with the highest degree of molecular resemblance of deuterostomes and this conservation may have predated deuterostome origins.

Cis-regulation analysis during P. flava gastrulation

Owing to the high degree of gene expression resemblance we observed during deuterostome gastrulation, we focused our study on cis-regulation of gene expression in this period of P. flava development. We identified open chromatin regions (OCRs) in the P. flava genome using assay for transposase-accessible chromatin sequencing (ATAC-seq)48 at four time points: early blastula (EB), early, mid and late gastrula (LG) (Fig. 1a, indicated with the red dots). We identified 82,807 OCRs throughout gastrulation, mostly located in ±20 kilobase (kb) vicinity of genes, predominantly within intergenic regions or introns, followed by promoters (Extended Data Fig. 9a,b and Supplementary Data 9 and 10). Binding motifs of trans-regulators such as developmental TFs appear more frequently outside promoters and in distal elements (Supplementary Fig. 5). Nearly 70% of the annotated genes have at least one OCR associated, with ~50% associated with two or more OCRs (Extended Data Fig. 9c). We observed a trend of OCR accessibility in accordance with expression levels of nearby genes (Extended Data Fig. 9d), suggesting that cis-regulation generally occurs as transcriptional activation during P. flava gastrulation. Genes annotated as trans-developmental (with GO terms in transcription or development, as similarly defined in ref. 49; Methods) tend to be associated with cis-regulatory elements (CREs), with more OCRs per gene than housekeeping genes (Extended Data Fig. 9e,f). This is in agreement with previously observed trends towards complexity of cis-regulatory architecture in genes orchestrating metazoan development49,50,51.

We identified 17,488 dynamically accessible OCRs throughout gastrulation, which were categorized into six clusters (Fig. 3a). These six clusters were enriched with specific DNA-binding motifs of major TF classes that matched those controlling gene expression during development in other metazoan lineages (Fig. 3b, Extended Data Fig. 9g and Supplementary Fig. 6). While there is overlap between blastula and gastrula OCRs, a substantial amount of OCRs are found specifically in blastula or during gastrulation. Furthermore, these stage-specific OCRs show enrichment for different motifs. For example, motifs of T-box TFs (including BRACHYURY, EOMES, TBR and TBX) are enriched in the OCRs at the EB and early gastrula stages. On the other hand, Sox-binding motifs are enriched in OCRs after gastrulation occurs. This dynamic is consistent with the transcriptome data, in which T-box and Sox factors are highly expressed during embryogenesis, with T-box predominantly present before gastrulation and Sox more abundant during gastrulation (Fig. 1e). Other examples include blastula binding motifs of E2F and the p53/63/73 family which are related to cell division (the latter being a known germ layer regulator52,53,54) and gastrula motifs of Dlx, Fox, Gsc and Pax which are known to be expressed in different germ layers during P. flava gastrulation (Extended Data Fig. 9)12,17,55,56. These results suggest that the highly expressed TF genes are functionally active in regulating target genes (TGs) via the identified OCRs. We further crossed these data with early temporal genes and also found binding motifs for early TFs, such as T-box factors, E2F and p53/63/73 (Supplementary Fig. 7), aligning with the early factors found on the basis of direct clustering of the ATAC-seq data (Fig. 3b). This concordance between the binding sites of temporal OCRs and the binding sites in the proximity of temporal genes suggests that potential regulatory features of temporal genes can be inferred on the basis of their surrounding cis-regulatory landscape.

Fig. 3: Cis-regulation of P. flava during gastrulation.
figure 3

a, Clustering of dynamically accessible OCRs over time. b, Motif enrichment analysis of each cluster of OCRs in a. Dot plot for HOMER logqvalue (colour) and fraction of peaks with motif (dot size). c, Schematic of the ANANSE analysis for networks of cis-regulatory interactions. Illustrations adapted from refs. 9,22. d, Scatter plot of TF centrality in each of the networks. x axis, centrality in EB network; y axis, centrality in LG network. e, Distribution of centrality in network for all the TF classes in the EB network (up) and the LG network (down). Centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile ranges; points, outliers. f, Scatter plot of main factors driving the transition as calculated by ANANSE influence. x axis, log fold change of gene expression; y axis, ANANSE influence score. Top 20 TFs are labelled with their inferred gene name (from in situ hybridization data (Supplementary Data 13) or eggNOG-predicted gene name). Chrom accs, chromatin accessibility; EG, early gastrula ; MG, mid gastrula.

Inferred regulatory networks during P. flava gastrulation

Developmental processes are driven by complex interactions between many genes constituting developmental GRNs. To assess putative cis-regulatory interactions in P. flava, we used analysis algorithm for networks specified by enhancers (ANANSE)57, a computational tool that uses cis-regulatory and gene expression data to generate GRN graphs and predict TF–TG interactions, focusing on activating TFs. These graphs are composed of nodes (the TFs and TGs) and edges (the interactions between TFs and TGs, whose score values are calculated using an additive model of gene expression, OCR accessibility, distance between OCR and its corresponding transcription start site (TSS) and motif enrichments within the OCR (Fig. 3c). We generated two independent graphs, one of EB and one of LG (Figs. 1a and 3c and Supplementary Data 11 and 12). The number of genes represented in both graphs is similar, both in number of TFs, number of effector genes and the relative connectivity of genes (Extended Data Fig. 9h–k). These graphs however showed structural differences as reflected by changes in the centrality of TFs (how much a given gene is embedded within each GRN, related to the number of connections to other genes)58 (Fig. 3d,e), in connections per TG, in number of self-regulating TFs and a net gain of trans-developmental centrality in LG (Extended Data Fig. 9l,m). Overall, TF classes showing higher centrality corresponded to the dominantly expressed TFs at the respective stage (Figs. 1e,f and 3e). For example, E2F and Pou classes show high centrality in the EB network and bZIP, ETS, GATA, Homeobox and Forkhead factors show higher centrality in the LG network. This suggests a different set of core TFs constituting GRNs at each stage. Common TFs found in the two networks (64% of all TFs in either graph) are enriched in terms of regulation of gene expression, which is reassuring of our analysis, with overall overlapping GO terms and functional categories in the two stages but higher enrichments for cell migration and organogenesis in LG-specific TFs (Supplementary Fig. 8a–e). The genes that are being controlled in the two networks also show differences. Out of their top TGs (defined as top % highest in degree), they share 615 (Supplementary Fig. 8f), with top targets exclusive of EB enriched in chromatin regulation (Supplementary Fig. 8g) and top targets of LG enriched in developmental and morphogenetic processes (Supplementary Fig. 8I). Interestingly, more than half of the gastrula-specific TGs were detected as postgastrulation temporal genes in our clustering analysis compared to EB-specific TGs (Fig. 1d and Extended Data Fig. 9n). These observations suggest that, in comparison to the EB network, the LG network is orchestrated by a different core of TFs controlling genes related to organogenesis and signalling, including genes that are poised to participate in larval development.

By the time of LG, the three germ layers have segregated into morphologically distinct territories in P. flava. On the basis of the known germ layer-specific expressing genes at the LG stage of P. flava (Supplementary Data 13), we subdivided the LG network into circuits that potentially mediate germ layer development (Extended Data Fig. 10a–c). Within those circuits, we noticed that TF genes showing high centrality tend to express broadly, while low centrality genes often display restricted expression patterns in the respective germ layer (Extended Data Fig. 10d,e). For example, xbp1 (ref. 13), soxb2 and otx are the most connected genes within the ectoderm circuit and are known to express in broad ectodermal domains at the LG stage. Similarly, genes show high centrality in the mesoderm network (for example, foxg and foxf) and endoderm circuit (for example, soxb1a and foxa) are expressed broadly in the mesoderm-derived protocoel and endoderm-derived archenteron of the LG embryos, respectively. Genes that are known to express in restricted regions in the corresponding germ layer, such as gsx (posterior ectoderm59), foxq2 (anterior ectoderm12), msx (dorsal mesoderm13) and tsg (anterior endoderm13), show low centrality in the respective circuit (Extended Data Fig. 10a–e). Other examples of TF genes with low centrality displaying restricted germ layer-specific expression patterns are shown in Extended Data Fig. 10e. The agreement between spatial expression patterns of TF genes and their hierarchical positions in GRNs coincides with the expected roles of TFs in specifying germ layers and the subdomains constituting the specific germ layer, with highly central TFs enriched in morphogenetic and developmental terms compared to low centrality TFs (Extended Data Fig. 10f). This suggests that this metric can help identify potential downstream TFs. We finally used ANANSE influence57 to detect factors responsible for the transition between the network states of EB and LG and found a number of well-known developmental TFs with known roles in germ layer-specific fate induction (Fig. 3f, Extended Data Fig. 10g and Supplementary Data 14). Altogether, the inferred networks based on RNA-seq and ATAC-seq data can potentially serve as useful models for understanding network evolution through cross-species comparison and for future functional validations.

Conservation of cis-regulatory mechanisms in deuterostomes

We reasoned that the transcriptional similarities we observed across species, including the expression dynamics during hemichordate gastrulation, might be attributed to conserved cis-regulatory interactions. We aimed to validate the conservation of deuterostome cis-regulatory interactions of foxa due to the following reasons. We observed that foxa is among the set of TFs with conserved expression during gastrulation and/or neurulation (Fig. 2c), appearing as a contributor to the transcriptomic similarities of deuterostome gastrula/neurula stages (Fig. 4a–d) and with high expression during mid development in the three deuterostome lineages (Fig. 4e). Foxa is also one of the most central TFs in the hemichordate LG network (Fig. 4f) and it is found among the top influential factors in the transition to the LG cis-regulatory landscape according to our ANANSE analysis (Fig. 4g). In addition, in the hemichordate gastrula ANANSE network, we observed putative cis-regulatory interactions between foxa and other orthologous genes that constitute two deuterostome regulatory kernels, the echinoderm endomesoderm kernel and the chordate axial mesoderm kernel60,61,62 (Fig. 4h,i and Supplementary Data 15). To assess the degree of conservation of the cis-regulatory information of foxa, we generated reporter constructs covering one to four potential CREs which we detected in our ATAC-seq data upstream of the P. flava foxa gene (Fig. 4j), the expression of which is initiated at the EB stage in the presumptive endomesoderm17,56. We used sea urchin embryos as a reporter assay system and observed that P. flava foxa CREs 2–4 are able to drive specific expression patterns, with CRE4 being sufficient to drive reporter expression in ectoderm and endoderm, which recapitulates the endogenous expression of sea urchin foxa56,63 (Fig. 4k–m and Supplementary Data 16). On the other hand, CRE1 is probably a general repressor, as the reporter construct CRE1–4 (containing CRE1 and active CRE2-4) lost transcriptional activity (Supplementary Data 16). The repressive role of CRE1 is consistent with the decrease of its openness after the EB stage (Fig. 4j), when the gene starts to express zygotically (Fig. 4e). These results validate the biological activities of the OCRs revealed by ATAC-seq and suggest similar regulatory inputs to the foxa gene during sea urchin and hemichordate gastrulation. This suggests that the conservation of the endomesoderm kernel could be pushed further to the common ancestor of Ambulacraria, reinforcing the notion that part of this kernel already existed in the bilaterian ancestor64. Overall, these results suggest that the transcriptomic similarities during deuterostome gastrulation could be the product of conserved cis-regulatory interactions that can be dated back to the common ancestor of ambulacrarians or deuterostomes.

Fig. 4: Conservation of cis-regulatory mechanisms in deuterostomes.
figure 4

a, Transcriptomic distance of P. flava and sea urchin using pairwise orthologous TFs (same panel as Supplementary Fig. 4c, reproduced here for clarity). b, Transcriptomic distance of P. flava and amphioxus using pairwise orthologous TFs (same panel as Supplementary Fig. 4d, reproduced here for clarity). c, Scatter plot of gene expression (log1p counts) between orthologous TFs in hemichordate LG and sea urchin mesenchyme blastula (MeBl). Two-sided linear model fitting. d, Scatter plot of gene expression (log1p counts) between orthologous TFs in hemichordate LG and amphioxus neurulation. Two-sided linear model fitting. e, Expression of foxa in P. flava, sea urchin and amphioxus. f, Scatter plot of TF centrality in the hemichordate cis-regulatory networks we constructed. g, Scatter plot showing top influential TFs in the transition from hemichordate EB to LG, with foxa highlighted among top influential. h, Predicted putative cis-regulatory interactions of the P. flava orthologues of the genes that form the endomesoderm kernel in echinoderms. i, Predicted putative cis-regulatory interactions of the P. flava orthologues of the genes that form the axial mesoderm kernel in chordates. For h and i, Pexp indicates expected probability of finding in the LG graph as many TFs and effector genes in the same proportion, randomly connected. j, Upper panel, ATAC-seq data showing four putative CREs (shaded in blue) upstream of the P. flava foxa gene at four different developmental stages. Lower panel, illustrations of the four reporter constructs assayed in sea urchin embryos. k, Percentages of EGFP expression domains driven by the four reporters at the mesenchyme blastula and gastrula stages. Percentage was calculated as the number of embryos with at least one positive cell, relative to the total showing GFP fluorescence. Bars represent average of percentages (n = 3). Error bars represent standard deviation. l, Schematic representation of the foxa gene in sea urchin and P. flava. Green shade indicates expression. m, EGFP expression driven by the indicated reporter constructs at the mesenchyme blastula and gastrula stages of sea urchin. All embryos are shown in the same scale (scale bar, 100 μm). Similar results were found in at least three independent experiments. Unless otherwise indicated, the gastrulae are viewed from the left side with ventral to the left. DV, dorsal view. Sp, S. purpuratus.

Discussion

In this study, we generated an extensive catalogue of developmental transcriptomics of the indirect-developing hemichordate P. flava. The developmental processes of P. flava is underlied by transcriptional programs exhibiting a biphasic behaviour, with LG and the Muller tornaria larva as the turning point (Fig. 1). The two major transcriptional programs are controlled by distinct groups of TFs as suggested by our co-expression analysis with temporal genes. Comparative transcriptomics with other deuterostome species and analyses of transcriptome ages suggest that gastrulation represents the stage of highest molecular resemblance within deuterostomes (Fig. 2).

Overall, our comparisons suggest that deuterostome gastrulation and larva stages are underlain by gene expression dynamics that predate the origin of this group, building on previous knowledge from recent and contemporary efforts6,34. Similar transcriptional dynamics between ambulacrarian larval stages indicates that similar transcriptional programs are deployed for constructing the larval forms, supporting a common indirect-developing ambulacrarian ancestor. Similarities between hemichordate gastrulation and cephalochordate neurulation but also annelid elongation stages (Fig. 2b,c; also shown in ref. 6) suggest that the underlying mechanisms around gastrulation predate the origin of deuterostomes. Similarities in ambulacrarian gastrulation seem to be mostly related to cell fate commitment, especially mesoderm and neurogenesis, suggesting that these mechanisms might have not changed during ambulacrarian evolution. Overall, the potential regulators orchestrating these transcriptional programs (the TFs) also appear to be largely conserved in this group, as shown by the early and late deployment of HMG and bZIP factors among deuterostome species as well as numerous orthologous TFs which share their temporal expression profile. Taken together with similarities to protostome late development, differences in TF expression timing (during or after gastrulation) between ambulacraria and cephalochordates might indicate that several processes taking place during postgastrulation stages in chordates and protostomes could have been deployed at an earlier time in development in ambulacraria. This aligns with our current knowledge that some developmental mechanisms occur after gastrulation in chordates, such as neurogenesis and the segregation of mesoderm and endoderm46,47

Overall, our findings suggest that deuterostome gastrulation features conserved mechanisms that date back to the common ancestor of Bilateria and are not a deuterostome-exclusive innovation. Our findings also suggest that deuterostome transcriptional programs are largely similar in composition despite the morphological disparity and differences in their life cycles, with some temporal adjustments probably accounting for the differences in the timing of developmental mechanisms between stages and for the differences in their lifestyles. A more parsimonious interpretation, as interphyla comparison can sometimes be difficult due to the taxonomical definition of phyla and the taxon sampling of choice65,66, is that only mid embryogenesis and larval stages appear more conserved when comparing distantly related phyla due to differences in their lifestyles because of the fundamental mechanisms of development at play during those stages such as epithelial–mesenchymal transition, AP patterning and germ layer specification. More comparative data on hemichordates and other deuterostomes at the tissue or cell-type levels, are required to explore the corresponding cell types that contribute to the overall transcriptomic similarities we observed. These could potentially hint at the degree of conservation of gene programs and the conservation of homologous mechanisms between cell types, germ layers and other structures throughout development across species.

Our comparative observations align with our cis-regulatory analyses. Our network analyses suggest the switch to the gastrulation cis-regulatory program consists of rewiring and addition of TFs to the GRN already at play during blastula, which translate into changes in TF/trans-developmental centrality and an increase in TF self-regulation (Fig. 3). This feedback property is expected to stabilize distinct regulatory states for germ layer specification. The effector genes also reflect the underlying processes at play in each stage, with an embedding of morphogenetic and postgastrulation effector genes in gastrula. Additionally, TF genes exhibiting higher centrality correspond to their broader spatial expression patterns, consistent with their hierarchical roles within the individual germ layer network. These results also reflect our previous observations on transcriptomic similarity across species, with a set of core TFs that have retained their timing of expression in deuterostomes detected as influential in the transition to the gastrulation cis-regulatory landscape. This suggests network inference based on RNA-seq and ATAC-seq data can serve as useful models for understanding network evolution through cross-species comparison and for future functional validations. Together, our observations suggest that the deuterostome LCA had a similar temporal program of TF genes which probably constituted hardwired GRNs, which aligns with findings about conserved GRN mechanisms despite genome rearrangements in echinoderms34. One putative example could be the endomesoderm kernel, for which we predicted potential cis-regulatory interactions. The ability of CREs near the hemichordate foxa gene to drive gene expression in sea urchin embryos suggests conserved regulatory inputs in the two lineages, reinforcing the idea that the endomesoderm kernel emerged before the onset of Ambulacraria. Similarly, the prediction of potential cis-regulatory interactions between genes of the chordate axial mesoderm kernel in P. flava suggests that these interactions might predate the origin of chordates and were present at least in the LCA of deuterostomes, although further functional studies are required to test this hypothesis. Of note is to mention that, since hemichordates lack a homologous structure to the notochord9,67, new questions arise as to which structure might deploy potential cis-regulatory interactions such as those of the axial mesoderm kernel in hemichordates. Potential candidates could be the endodermal region and the mesoderm-derived protocoel, since some of the orthologous genes within the network are known to express in these embryonic territories12,15,17. More validation using in situ hybridization is needed to corroborate the co-expression of genes belonging to this GRN. Likewise, we acknowledge the limitation of our methods as these analyses are based on leveraging gene expression and chromatin accessibility to reconstruct TF–TG interactions and cannot reconstruct potential protein–protein interactions (such as those between ligands and receptors). Likewise, although TF binding motif association is based on sequence homology-informed automated transfer similar to standard practices68, it is done in a lenient fashion, with multiple motifs assigned per TF and the highest scoring one taken into account on a TF/TG case-by-case basis, which may hamper true 1:1 associations such as those derived from ChIP–seq experiments69. Nevertheless, the transcriptomic analysis and the inferred networks provide a global view and working models for future studies. Overall, these comparative analyses depict a more complete picture of the conserved cis-regulatory interactions during gastrulation of deuterostome LCA.

In summary, this study presents a high-throughput, genome-wide, multimodal catalogue of quantitative regulatory genomics in hemichordates. In doing so, it establishes the basis for comparative studies in deuterostome lineages with a global characterization of the embryonic and larval development of an indirect-developing hemichordate. All together our data effectively generate hypotheses for exploring the similarities and divergences of the molecular mechanisms underlying deuterostome and metazoan development.

Methods

Sample collection and animal husbandry

P. flava adults were collected from Penghu Islands, Taiwan. Methods for spawning induction and embryo culture were described previously46. For RNA-seq, total RNA of each developmental stage was isolated from one to three biological replicates using the RNeasy Micro Kit (Qiagen). Library preparation for stranded polyA-based messenger RNA was then performed using the TruSeq Stranded mRNA Library Prep Kit (Illumina). The multiplexed complementary DNA libraries were subjected to sequencing with 151 base pair paired-end read length on a HiSeq 2500 sequencer (Illumina). The average sequencing depth was 24,861,133 paired reads per sample. For ATAC-seq, embryos at different stages were collected to reach 75,000 cells per sample (~700 early blastulae, ~150 early gastrulae, ~100 mid gastrulae and ~50 late gastrulae). The collected embryos were treated with 0.1% N-acetylcysteine for 10 s to remove the jelly coat, followed by incubating with 8.5 mM of dithiothreitol in seawater containing 5.7 mM of NaOH. The cells were then lysed for the transposition reaction and PCR amplification and the purified libraries were sequenced as described49.

In situ hybridizations

In situ hybridizations were performed using embryos of P. flava at the LG stage, as previously described in refs. 59,70,71.

EGFP quantification

Owing to mosaic incorporation of the construct, which commonly happens, we counted an embryo as positive if it showed at least one cell or region of the embryo with fluorescence signal. For each germ layer and CRE construct, the percentage of EGFP embryos was calculated as the ratio of positive embryos for a given germ layer divided by the total number of EGFP embryos observed. EGFP expression of CRE1–4 was inconsistent and yielded only two positive embryos, which we considered inconclusive.

Gene annotation

Generation of protein sequences

We ran TransDecoder (https://bio.tools/TransDecoder) using the set of longest isoforms per gene of P. flava, querying against the SWISSPROT and Pfam databases to gather evidence at the sequence homology and protein domain level. This evidence was subsequently incorporated into TransDecoder and we retrieved the longest predicted protein sequences with the parameter --single-best-only.

Gene ontology annotation

Using the TransDecoder-predicted set of protein sequences, GO terms were generated using a combination of BLAST2GO and eggNOG annotation software, using RefSeq non-redundant proteins database72 and eggNOG metazoa73 as reference databases, respectively.

Gene ontology enrichment analysis

For every cluster, GO enrichment analysis was performed using the topGO R package74,75 using the whole gene set of P. flava as gene universe and the elim method74.

Trans-developmental and housekeeping classification

A gene was classified as trans-developmental if annotated with any of the following GO terms: GO:0044430, GO:0006412, GO:0003735, GO:0005840, GO:0005842, GO:0030529, GO:0006414, GO:0009119, GO:0009062, GO:0044262, GO:0006725; but none of the GO terms for housekeeping genes. A gene was classified as housekeeping if annotated with any of the following GO terms: GO:0009790, GO:0030154, GO:0043565, GO:0007267, GO:0008380, GO:0003700, GO:0140297, GO:0010467, GO:0010468, GO:0016070; but none of the GO terms for trans-developmental genes.

Transcription factor annotation

We first generated groups of orthology using P. flava, purple sea urchin, Japanese feather star, amphioxus and well-annotated vertebrate species (human and zebrafish) using OrthoFinder76,77 (Supplementary Data 5). Using human and zebrafish queries for TFDB Metazoa78, we transferred TF class annotation to their respective orthogroup and thus to every gene of the other species. Second, we also added to this classification all genes identified as a TF homologue by the GimmeMotifs tool motif2factors (briefly, transferring TF annotation based on orthology evidence using OrthoFinder against a standard set of model species; see gimmemotifs documentation), which was required for ANANSE (see later). We selected ‘top’ TF classes as having high coefficient of variation in the expression datasets of P. flava, S. purpuratus and B. lanceolatum. The resulting TF annotation was used for downstream TF analysis.

Time-series RNA-seq analysis

To quantify gene expression, reads were pseudo aligned against the set of longest transcripts per isoform per gene using kallisto79. RNA-seq counts were corrected for batch effect and quantile normalized using edgeR80,81,82 and RUVSeq83 taking the 10,000 bottom-ranked genes in edgeR as in silico empirical, invariable genes. Pairwise Spearman correlation of stages and replicates was done, clustered and visualized in R84 using base and ComplexHeatmap85 packages. We also clustered the developmental transcriptomes in an unsupervised manner using Pearson correlation with bootstrapping (co-occurrency as described in refs. 86,87). DGEA was performed using DESeq2 (ref. 88) using likelihood ratio test (LRT) and RUV-correction as reduced formula. Genes with an adjusted P < 0.05 (negative binomial LRT) were taken as differentially expressed. For every replicate, variance-stabilized counts were averaged to generate a time-series dataset of every differentially expressed gene throughout development. For every gene, expression data were scaled on a z-score (standard deviation over mean). Temporal clusters were then generated using hierarchical clustering on the scaled dataset, with the following parameters: method = “ward.D2”. Data visualization was done using R base (https://www.r-project.org/), ggplot2 (ref. 89), RColorBrewer90, colorspace91, pheatmap (RRID:SCR_016418) and ComplexHeatmap85 packages under R v.4.0.3.

Transcription factor analysis

For a given TF class X in a given species (P. flava, B. lanceolatum and S. purpuratus) at a given developmental stage Y, we computed (1) the sum of genes of TF class X expressed (>1 transcripts per million) at stage Y; and (2) the sum of transcripts per million (using kallisto92, see earlier) for every expressed gene of class X at stage Y. We used the calculations above to perform a Chi-squared test of abundance of every TF class in every developmental stage of P. flava. To measure correlation between classes and temporal gene expression, we first computed the average expression profile (z-scored) of every temporal cluster. Then, for a given TF class X, we computed the number of TFs of class X with a similar expression profile (z-scored) as that of the stage-specific cluster (with a Spearman correlation >0.6), divided by the number of TFs in class X. We did this for every TF class with every temporal cluster. For motif discovery analysis, we first extracted the OCRs (differential or not) associated with genes of each temporal cluster of gene expression (prelarval development). Then, we performed motif enrichment analysis using the program FindMotifsGenome.pl from the Homer tool suite93.

Re-analysis of other species

Re-analysis of sea urchin

We retrieved RNA-seq reads of S. purpuratus from ref. 94 and mapped them against the Spur5.0 (ref. 95) set of longest isoforms per gene using kallisto92. Read counts were imported to R using the tximport package. Counts were transformed to account for batch effect using edgeR and RUVSeq as explained above and we used edgeR to identify the top variable genes in lieu of performing DGEA because of the lack of replicates in the original dataset (the only sample with two replicates, mesenchyme blastula, was kept as separate replicates for all the analyses). We clustered these highly variable genes in R using hierarchical clustering. We also annotated the TF repertoire of sea urchin and measured TF class composition of each transcriptome as well as measured correlation between classes and temporal gene expression, as explained above for P. flava. Code and full documentation is available at the online repository of the project.

Re-analysis of B. lanceolatum

RNA-seq reads of B. lanceolatum from ref. 49 were mapped against the set of longest isoforms per gene of the currently available transcriptome96 using kallisto92. Read counts were imported to R using the tximport package. We used DESeq2 to identify temporally expressed genes using the negative binomial likelihood ratio test as explained above. We clustered the resulting temporally expressed genes in R using hierarchical clustering to retrieve clusters of temporal genes. We also annotated the TF repertoire of B. lanceolatum and measured TF class composition of each transcriptome as well as measured correlation between classes and temporal gene expression, as explained above for P. flava. Code and full documentation are available at the online repository of the project.

Re-analysis of A. japonica

RNA-seq reads of A. japonica from ref. 19 were mapped against the set of longest isoforms per gene of the currently available transcriptome using kallisto92. Read counts were imported to R using the tximport package. We used DESeq2 to identify temporally expressed genes using the negative binomial likelihood ratio test as explained above. We clustered the resulting temporally expressed genes in R using hierarchical clustering to retrieve clusters of temporal genes. Code and full documentation are available at the online repository of the project.

Re-analysis of B. floridae

RNA-seq reads of B. floridae from ref. 97 were mapped against the set of longest isoforms per gene of the currently available transcriptome using kallisto92. Read counts were imported to R using the tximport package. We used DESeq2 to identify temporally expressed genes using the negative binomial likelihood ratio test as explained above. We clustered the resulting temporally expressed genes in R using hierarchical clustering to retrieve clusters of temporal genes. Code and full documentation are available at the online repository of the project.

Re-analysis of O. fusiformis

We retrieved the published expression data and the temporal gene cluster information of O. fusiformis from the original repository of the reference publication6.

Orthology and gene age assignment

We ran OMA standalone98 using the set of predicted protein sequences of P. flava and a database of proteomes from 24 species (Supplementary Data 5), using standard parameters. The resulting orthogroups matrix was used to assign a gene age to each orthogroup (and therefore to every P. flava gene in a given orthogroup) based on Dollo parsimony using the Count software99. We also parsed the .orthoxml output file of hierarchical orthogroups to assign a gene family ID to every P. flava, B. floridae, B. lanceolatum, S. purpuratus, A. japonica and O. fusiformis gene. These gene id–gene family ID association tables were used in our comparative analyses (see later). As an alternative method for orthology inference, we ran OrthoFinder76,77 using standard parameters on the same set of proteomes as we did for OMA. We retrieved one-to-one orthologues from the results and also parsed the Orthogroups.tsv output file to assign a gene family ID to every P. flava, amphioxus and sea urchin gene. In addition, we also ran BLASTp best-reciprocal hits between all species of interest (see section on ‘Comparative analysis’ next, including P. flava and B. floridae) to retrieve one-to-one orthologues.

Comparative analyses

Using our OMA output, we retrieved all one-to-one orthologues between P. flava and S. purpuratus, between P. flava and A. japonica, between P. flava and B. lanceolatum, between P. flava and B. floridae, and between P. flava and O. fusiformis For a given pair of species, we used a set of custom wrappers we called comparABle (https://github.com/apposada/comparABle), which incorporate well-established methods in the literature (such as pairwise transcriptomic comparisons and orthogroup overlap; Extended Data Fig. 1; see later), to perform comparative analyses between these pairs of species. Alternatively, we also used the Orthofinder and BLASTp reciprocal best hits outputs to retrieve one-to-one orthologues between these species and run the comparative analysis. For the unified deuterostome comparisons, we used a list of 3,192 1:1:1 orthologues inferred to have been present in the deuterostome LCA in ref. 36. Code and full documentation are available at the online repository of the project.

Pairwise transcriptomic correlations

Using the whole temporal transcriptome datasets of each species, we subsetted these datasets to retrieve only common one-to-one orthologues and calculated distance and correlation metrics (Pearson correlation, Spearman correlation and Jensen–Shannon divergence) between pairs of transcriptomic stages. We calculated the average Jensen–Shannon divergence using bootstrapping of 1,000 iterations with replacement and also performed co-occurrence analysis86,87 using all genes and bootstrapping with replacement using the concatenate of the two datasets.

Identification of genes driving similarities between stages

Using the most closely matched pairs of stages according to the correlations described above, as well as selecting particularly interesting stage pairs, we identified key genes expressed during specific stage combinations across various species. Briefly, for a pair of stages a.i and b.j, we generated a design matrix to compare a.i and b.j against the rest of stages, performed linear modelling on the concatenate dataset of one-to-one orthologues using limma100 and retrieved the top genes with P < 0.05. Because this method finds genes specifically expressed in these two stages and not the others, it effectively retrieves a subset of the genes that drive similarities between stages a.i and b.j in a conservative manner, at the expense of other genes conspicuously expressed which might also be important for these similarities even if they are not captured by this method.

Gene family usage analysis

We compared all possible combinations of clusters of temporal genes between pairs of species by performing orthogroup overlap analysis6,49. For a given cluster in species ‘a’ a.x and a given cluster in species ‘b’ b.y, we counted the number of overlapped orthogroups between a.x and b.y (the number of genes from a given orthogroup in clusters a.x and b.y) and subjected these to upper-tail hypergeometric test. Using the information of the pairs of clusters sharing the highest amount of orthogroups, we retrieved the genes from P. flava belonging to those orthogroups and performed functional category analysis using the clusters of orthologous groups annotation derived from eggNOG (see earlier).

Comparative analyses using TFs

We generated reciprocal best hits of P. flava versus B. lanceolatum and P. flava versus S. purpuratus using BLASTp with standard parameters, and used these as proxy for one-to-one orthologues as an alternative method to OMA or OrthoFinder. Upon seeing similar trends regardless of using OMA or reciprocal best hits, we used reciprocal best hits to identify conserved TFs between the two pairs of species. We subsetted the expression datasets to keep the one-to-one TFs and performed the same distance metric correlations between pairs of species. Using the information on the most similar stages we proceeded with linear modelling to retrieve the TF genes driving similarities between stages as explained above.

Gene age enrichment and TAI analyses

We performed gene age enrichment analysis using contingency tables for every cluster and gene age. Results were considered significant when post hoc Chi-squared test P value was <0.05. We performed TAI analysis using the R package myTAI and the flatline test.

ATAC-seq analysis

ATAC-seq data analyses were performed as described in ref. 101, using standard pipelines48,49. Briefly, reads were aligned with Bowtie2 using a de novo sequenced, HiC-guided assembly of the genome of P. flava (deposited; ‘Data availability’). We generated a dataset of peaks using Macs2 (ref. 102) for peak calling and Bedtools103 to quantify the accessibility of each chromatin region. Differential analysis was performed using DESeq2 v.1.18.0 in R v.3.4.3 (ref. 88) using LRT. An adjusted P = 0.05 was set as a cutoff for statistical significance of the differential accessibility of the chromatin in ATAC-seq peaks. Motif enrichment was calculated using the program FindMotifsGenome.pl from the Homer tool suite93. The k-means clustering of the ATAC-seq signal was performed using R84 and visualized using ComplexHeatmap85. OCRs were assigned to the closest TSS using bedtools closestbed and standard parameters. Categorization of OCRs into distal, proximal, intronic and so on, as well as de novo motif finding and motif enrichment analysis, was done using the program annotatePeaks from the Homer tool suite93.

Cis-regulatory reporter assays

The ATAC-seq data revealed four potential regulatory elements (REs) locating within 3 kb upstream of the foxa coding sequence. PCR products containing 4, 3, 2 or 1 REs were cloned into the Hugo’s lamprey construct vector35,104 upstream of an egfp coding sequence and an SV40 polyadenylation sequence using In-Fusion Snap Assembly master mixes (Takara). To prepare the injection mixtures, 4 ng of reporter constructs (PCR products), 1.2 μl of 1 M KCl, 154 ng of carrier DNA (HindIII-digested S. purpuratus genomic DNA) and 5 μg of Dextran Alexa 555 were added into nuclease-free water (Invitrogen) to make 10 μl of mixture. About 4 pl of this mix was introduced into S. purpuratus zygotes and the EGFP signal was observed at blastula and gastrula stages.

Cis-regulatory networks

We computed two networks of cis-regulation (EB and LG) using ANANSE57. We initially generated a custom database of TF–domain association using GimmeMotifs motif2factors105, tailored for the genome annotation of P. flava using the proteomes and TF annotation of Homo sapiens and Danio rerio as reference, as well as the proteomes of several deuterostome species (including echinoderm and hemichordate) and the protostome O. fusiformis as outgroup to provide additional phylogenetic signal for the orthology. Next, we recentred the ATAC-seq OCR co-ordinates around the summit of the peak of chromatin accessibility using samtools mpileup to obtain a dataset of summit-centred peaks. For each developmental stage, we first ran ANANSE binding to create a genome-wide profile of TF binding across regions of open chromatin using the .bam alignment of the ATAC-seq and the set of summit-centred peaks and then ran ANANSE network to create a network of TFs and TGs using the binding profile and the RNA-seq expression data. The resulting network was imported to R and analysed using iGraph. We subsetted the network to retrieve edges of the top 5% (ANANSE score value, ‘prob’, above quantile 0.95) and subsequently computed network size, number of connections per TG and per TF class centrality using custom wrappers for iGraph functions in R. Germ layer genes were retrieved from in situ hybridization data on hemichordates in the scientific literature and assigned to the newest annotation using BLAST best-reciprocal hits (Supplementary Data 11). Genes from the endomesoderm/paraxial mesoderm kernels were retrieved from the scientific literature60,61,106,107 and used as query on our OMA orthology run, our Orthofinder orthology run or our set of BLAST reciprocal best hits against the newest annotation of P. flava a BLAST best-reciprocal hit against the newest annotation of P. flava. We then subsetted the postprocessed LG graph (used for all of our analysis, with the top 5% of interactions) querying for the respective orthologues of the aforementioned genes in P. flava. For each of the two queries to the LG graph we also computed a probability distribution of randomly finding as many genes connected as queried. This probability distribution was computed as 100,000 iterations of random subsampling of the postprocessed LG graph, taking into account both TFs and effector genes, as the latter are naturally less connected than TFs because of the directed nature of these graphs. Graphs were pruned for visualization using a custom R iGraph wrapper to keep the top three-to-five interactions per gene and visualized using the iGraph R package.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.