My PhD Research

Category: Science

Hello!

I wrote this article to explain my completed PhD research at Monash University in Melbourne, Australia. My PhD research focus was on Polycomb Repressive Complex 2 (PRC2), an essential epigenetic regulator implicated in development and cancer.

This article is aimed more towards a general technical audience. It begins with a summary describing my PhD experience and skills I developed. Then I go into the technical aspects by introducing foundational concepts in molecular biology and epigenetics, and then follow that by going more in depth to describe the research projects that comprised my doctoral work. After the most technical aspects of my work are discussed, I then describe the "bigger picture" behind how basic research into foundational biochemical processes can lead to disease treatment breakthroughs. Finally, I detail some of the side passion projects in bioinformatics I developed that helped me break down some barriers to getting my work done in the laboratory.

What My PhD Was Like

My PhD was entirely research focussed, which means that I had no classes or exams and essentially worked full time in a research laboratory. My responsibilities were highly similar to a full-time work position, with the additional requirement of producing a long written thesis at the end.

My day-to-day responsibilities

Planning and performing experiments
Analysing and interpreting experimental data
Reading scientific literature to stay current in the field
Troubleshooting failed experiments and refining protocols
Attending regular lab meetings to present data or discuss experiments and data
Writing and revising manuscripts for peer-reviewed journals
Coordinating with local and international collaborators on projects
Attending and presenting at scientific conferences (posters and talks)
Maintaining laboratory equipment, ordering reagents, and managing inventory
Supervising and training new students and other lab members in laboratory techniques
Writing a doctoral thesis (~80,000 words)

My thesis is posted on the Monash University website here, but may be under embargo: https://doi.org/10.26180/24061752

Skills and Capabilities

My doctoral research required managing complex, multi-year scientific projects from initial conceptualisation to final publication in scientific journals. Beyond my foundational work in quantitative biochemistry, much of what I did involved troubleshooting complex protocols, building data processing pipelines, and developing and applying analytical methodologies (including statistical analysis).

Biochemistry and Biophysics

Method development and adaptation (e.g. adapting cell-based crosslinking protocols for use with purified recombinant proteins)
Recombinant protein expression and purification (baculovirus/insect cell system, bacteria)
Insect cell culture for recombinant protein expression (Sf9/Hi5 cells, baculovirus infection, expression optimisation)
Purification protocol development and troubleshooting (multi-subunit complexes with difficult-to-incorporate components)
Multi-step chromatographic purification (amylose/Ni-NTA affinity, heparin ion exchange, size exclusion via FPLC)
PCR amplification (e.g. DNA templates for substrate reconstitution, cloning)
Cloning and site-directed mutagenesis (via Gibson assembly)
Nucleosome reconstitution (histone purification, octamer assembly, gradient salt dialysis)
in vitro RNA transcription (RNA substrate design and production)
Quantitative DNA- and RNA-binding assays (EMSA with fluorescein-labelled probes)
Fluorescence anisotropy binding assays
Enzyme activity assays (radioactive assays using ¹⁴C-labelled substrates, luminescence-based assays)

Proteomics and Data Analysis

Proteomics data processing (MaxQuant/Andromeda)
Densitometry and gel quantification (via ImageJ)
Cross-gel normalisation using serial dilution standards
Statistical analysis: statistical tests (Student's t-test, ANOVA)
Quantitative analysis (curve fitting, K_d determination, Hill coefficient determination with GraphPad Prism)

Visualisation and Communication

Presentations
- Formal conference presentations 10–20 minutes long plus Q&A.
- Annual presentations to a thesis committee of professors explaining my project. This includes scientific background, research progress, explanation/justification of research direction and methods, etc. plus Q&A.
- Group presentations that are typically more casual for updating immediate lab group peers on research progress. Also an opportunity for troubleshooting issues. Typically a 30–45 minute presentation plus Q&A every few months.
Scientific writing (three peer-reviewed publications, two of which are as co-first author)
Scientific figure design (Adobe Illustrator/Photoshop)
Poster presentations at conferences
Protein 3D structure visualisation (PyMOL)

Computational

Programming: R (created a Shiny web application, data processing scripts) and Python (sequence analysis tools)
Version control using Git and GitHub
Bioinformatics and R workshops and conferences (attended during PhD, foundational exposure)

Introduction to Molecular Biology for Non-Scientists

Much of molecular biology describes how information flows inside living things. The instructions for life are stored in DNA, a long macromolecule that acts like a vast library of instructions. Our bodies make use of these DNA instructions by first transcribing them into an intermediate molecule called RNA in a process called transcription. Unlike DNA, the RNA can be processed and then its sequence translated from an RNA sequence by particles called ribosomes into proteins, in a process called translation. The ribosome does this by reading the instructions in the RNA and then joining amino acids in the cell together in a chain in a specific order. When incorporated within a protein, we often refer to amino acids as residues. Once this chain gets to more than roughly 50–100 residues long, we refer to it as a protein. Shorter chains are called peptides or polypeptides. The processes of transcription and translation are central to the molecular basis and function of life itself, and are referred to as the central dogma of molecular biology. Proteins will fold or assemble into a native, often dynamic, 3-dimensional conformational state that is largely determined by its amino acid sequence in a process called protein folding.

Figure 1: The central dogma of molecular biology. Genetic information in protein-coding genes flows from DNA to RNA to protein. During transcription, a protein enzyme called RNA polymerase uses DNA as a template to produce messenger RNA (mRNA). During translation, biomolecules called ribosomes read the mRNA sequence and assemble amino acids into a polypeptide chain, which can fold into a functional protein. The bottom row shows three common ways of representing protein structure, using ubiquitin as an example (PDB ID: 1UBQ). A surface representation emphasises the overall molecular shape. A ribbon representation highlights secondary structure and folding. The stick representation shows atoms and chemical bonds in greater detail, highlighting the complexity. The N-terminus is highlighted in orange, and the C-terminus is highlighted in magenta.

Proteins make up between 9–20% of the adult human body by weight and are responsible for many essential tasks in our bodies, such as carrying oxygen in our blood, digesting food, or forming structures such as muscle fibres or hair. Proteins that increase the rate of biochemical reactions are called enzymes. Enzymes are often hyper-specific to their target, or substrate, and this is often referenced as the 'lock and key' model.

Category	Primary Function	Key Examples
Enzymatic (Catalytic)	Act as biological catalysts to increase reaction rates or speed up chemical reactions.	- DNA Polymerase: Synthesises DNA - Amylase: Breaks down starch into sugar in our saliva - ATP Synthase: Synthesises ATP, a common form of energy storage in the body
Structural	Provides mechanical support, shape, and stability to cells, tissues, and organs.	- Collagen: A major protein in connective tissue. Present in skin, ligaments, tendons, bones, and cartilage - Keratin: Fibrous protein which makes up body parts such as hair, nails, horns, scales, feathers, etc. - Actin: Provides cells and muscles with structure - Tubulin: Provides cells with shape and structure
Transport	Moves molecules, ions, or nutrients throughout the body or across cellular membranes.	- Hemoglobin: Transports oxygen in the blood - Aquaporins: Transports water between cells - Serum Albumin: The most abundant blood protein in mammals. Maintains blood pressure, and transports steroids, fatty acids, and thyroid hormones
Regulatory and signaling	Facilitates cellular communication and coordinates complex physiological processes.	- Insulin: Promotes glucose absorption. Dysfunction is implicated in diabetes - G-Protein Coupled Receptors (GPCRs): Detect molecules outside the cell to activate cellular responses. Many different kinds exist - Transcription Factors: Controls the rate of genetic transcription, allowing genes to be active or repressed
Defense (Immunological)	Identifies, neutralizes, and destroys foreign pathogens such as bacteria and viruses.	- Immunoglobulins (Antibodies): Used by the immune system to fight disease-causing bacteria and viruses - Complement Proteins: Enhances the ability of antibodies and immune cells to clear microbes and damaged cells from an organism, promote inflammation, and attack the membranes of invading cells
Motor and contractile	Converts chemical energy into mechanical work to enable movement.	- Myosin: Involved in muscle contraction - Kinesin & Dynein: Involved in transporting cargo around the inside of cells
Storage	Serves as biological reserves for essential nutrients, metal ions, or amino acids.	- Ferritin: Stores iron and releases it in a controlled way - Myoglobin: Stores iron and oxygen in cardiac and skeletal muscle tissue

There are estimated to be 20,000 protein-coding genes in human DNA (Amaral et al. 2023, PMID: 37794265). This is only one dimension of the type of complexity that gives rise to life. For example, for many genes, the RNA transcript can be processed and re-arranged in various ways prior to translation in a process called alternative splicing. These different RNA transcripts transcribed from the same gene can then be used to make different proteins. After proteins are created, they can then potentially be additionally modified by other proteins in our cells through post-translational modifications. Only certain proteins undergo post-translational modifications, and typically at certain residues. In some cases, such as the protein insulin, proteins can be cleaved (cut) to make the protein shorter in order to modify or cease their function. Based on this, we can start to appreciate the level of complexity that exists at the molecular level of biological systems.

There are many types of post-translational modifications (including glycosylation, phosphorylation, and others), but three are particularly relevant to my research:

Methylation - The transfer of methyl groups from S-adenosyl methionine (SAM) to certain protein residues and to DNA itself. Methylation is essential in switching certain genes between an active and inactive state (potentially in either direction depending on the specific modification).
Acetylation - Attachment of an acetyl group to certain protein residues. If attached to a lysine residue, it neutralises that residue's characteristic positive charge, altering how it interacts with other molecules. Acetylation is usually involved with promoting certain genes into an active state.
Ubiquitylation - Attachment of a protein called ubiquitin to a lysine residue. Mono-ubiquitylation (attachment of one ubiquitin) can regulate protein localisation within the cell and its function. Poly-ubiquitylation (attachment of many ubiquitin molecules) marks the protein substrate for degradation and recycling. This typically happens to damaged, misfolded, or obsolete proteins.

Epigenetics: Regulation of Gene Expression and Cell Identity

The vast majority of cells in our body share the same DNA sequence, yet we have an incredibly wide variety of specialised cell types. All of our organs and tissues must behave differently and create different products from the same genome. Throughout our development, the necessity for particular genes in our DNA to be active or switched off can change. The field which concerns the way that genes in our DNA can be switched on and off to control this specialisation and differentiation of function and identity is called epigenetics.

To understand how genes are repressed and derepressed, it is necessary to first understand how DNA is stored. Almost all of the estimated 30–40 trillion cells in our bodies contain about 2 metres of DNA (exceptions include red blood cells, platelets, and others). It's estimated that all of the DNA in the adult human body if stretched end to end could extend from Earth to Neptune back and forth 10 times (Francis Crick Institute video clip). In order to fit all of this DNA into a compartment of the cell where the DNA is stored called the nucleus, which can only be about a few micrometres wide (a few one-thousandths of a millimetre), the DNA needs to be wrapped around proteins called histones. The histone proteins bind together into an octamer (8-piece) arrangement, which when DNA is wrapped around it, we call it a nucleosome. The histones which make up the histone octamer are two each of histone H2A, histone H2B, histone H3, and histone H4. These histones have positively charged amino acids on their surface which can bind DNA (which is negatively charged). This allows the DNA and nucleosomes to form a close association and stable structural arrangement.

This arrangement of DNA and nucleosomes is called chromatin, and can be compared in appearance to yarn wound into balls, except that these balls are linked together and can pack tightly or loosen depending on the gene and how it is being regulated. When chromatin is tightly packed together, they form heterochromatin which is inaccessible by cellular machinery which aims to read and transcribe the genes into RNA which may be used to create functional proteins for specific tasks. When chromatin is loosely packed and accessible, it is called euchromatin. Euchromatin is considered active and factors such as RNA-polymerase II can access active genes to transcribe RNA.

Figure 2: “Epigenetic mechanisms”, produced by the National Institutes of Health, public domain, via Wikimedia Commons. Illustrates how DNA is packaged from chromosomes into chromatin, where DNA wraps around histone proteins. Epigenetic marks can alter gene expression without changing the underlying DNA sequence. DNA methylation involves the addition of methyl groups to DNA, and can repress or activate genes depending on genomic context. Histone modifications occur when chemical groups bind to histone tails, changing how tightly DNA is wrapped around histone proteins. Tightly packed chromatin (heterochromatin) can make DNA less accessible and genes inactive, whereas more open chromatin (euchromatin) can make DNA accessible and genes active. Chromatin organises itself into distinct separate bodies called chromosomes, which further organise the genes into specific loci.

Whether a gene is repressed or active can depend on the type of cell, the point of development of an organism, and how the histone proteins are modified. This is where the post-translational modifications introduced earlier become directly relevant. Histones have long, flexible 'tails' that extend outward from the nucleosome core, and these tails can be modified by the same types of chemical marks described earlier (acetylation, methylation, ubiquitylation), each with specific consequences for gene activity. For example, acetylation of histone H3 at lysine 27 (written as H3K27ac, where K is the one-letter symbol for lysine) neutralises the positive charge that holds DNA tightly to the histone, loosening the chromatin, and making the associated gene more accessible and active. These modifications can also recruit other factors that further activate or repress genes, creating layered regulatory networks.

Ubiquitylation plays a distinct role on histones, acting as a signal to other regulatory proteins. One key example is H2AK119ub1 (histone H2A, lysine 119, modified with a single ubiquitin), which is deposited by Polycomb Repressive Complex 1 (PRC1) and acts to repress gene expression at target loci (position in the genome). Finally, histone methylation marks act as molecular labels recognised by other factors to either activate or repress genes. One example, H3K36me3 (tri-methylated lysine 36 of histone H3), is deposited by the enzyme SETD2 and is associated with active chromatin. Another modification called H3K27me3 is associated with gene repression, and was a primary focus of my PhD research.

There is much we don't know regarding the way that biomolecules such as proteins, DNA, RNA, etc. behave and interact to orchestrate epigenetic functions and regulation. The networks which describe how all these biomolecules interact with each other to influence and fine-tune the function of a healthy and properly differentiated cell can be incredibly vast and complex. For example, see Fig. 1 from this 2016 paper by Hauri et al. on the Polycomb group of epigenetic protein complexes and their interactions with other Polycomb and non-Polycomb group proteins. Determining which proteins interact with each other, or which proteins are associated with each other in the same regulatory networks remains a significant challenge in many cases. Ascertaining the mechanisms and purpose of these interactions is more challenging still. There is still so much we don't know about many of these epigenetic control networks, and new incremental discoveries continue to be made that slowly build upon our understanding.

PRC2 is an Epigenetic Regulator That Maintains Gene Repression at Target Sites

My PhD research was primarily focussed on Polycomb Repressive Complex 2 (PRC2). PRC2 is a protein complex. A protein complex is a protein structure composed of many proteins that are translated separately and assembled together post translation. Each protein in a protein complex is called a subunit. Like many protein complexes, each subunit in PRC2 has a specialised function which contributes to the overall function and activity of the PRC2 complex.

PRC2 is an epigenetic regulator that maintains gene repression at specific target loci. PRC2 stops specific target genes from being active and expressed at specific stages of development and differentiation. For example, genes essential for cell growth and division may be important sometimes, but when they are not needed, they must be "switched off" (i.e. repressed). It is a role of PRC2 to maintain these genes in a repressed state when they are not needed. PRC2 is the only known enzyme responsible for depositing H3K27me3 in humans (and all vertebrates). PRC2 is what is known as a methyltransferase enzyme. It takes a methyl group from another non-protein molecule called S-Adenosyl Methionine (SAM, known in biochemistry as a cofactor or methyl donor), and appends it to the target substrate to methylate it. PRC2 is known as a histone methyltransferase (HMTase) as it specifically tri-methylates lysine 27 of histone H3 (H3K27) to form H3K27me3.

H3K27me3 and PRC2 are considered essential, in that human life is not viable without them. If PRC2 experiences a loss of function due to a mutation or other cause in a disease state, these growth and division genes initiate uncontrolled cell growth and division, perhaps even resulting in cancerous tumours. We can call these genes that PRC2 fails to suppress in this described disease state oncogenes. In some disease states where PRC2 is implicated, PRC2 can also errantly maintain repression at tumour suppressor genes, which are genes that act to stop formation of cancerous tumours by slowing down cell division, repairing DNA errors, or initiating programmed cell death (apoptosis) when appropriate.

Mutations that make components of PRC2 overactive can be found in certain lymphomas and sarcomas. One example is diffuse intrinsic pontine glioma (DIPG), an aggressive childhood brain cancer in which a single mutation in histone H3 disrupts PRC2's ability to methylate chromatin normally. DIPG has a median survival of less than a year after diagnosis, and there is currently no effective treatment. Understanding how PRC2 is regulated and how that regulation breaks down may inform the development of therapies for these diseases.

Outside of the context of early development, mutations in PRC2 subunits or related proteins have been known to cause disease. These mutations may occur after development, and they can lead to conditions such as cancer and developmental syndromes. One prominent example is the H3K27M oncohistone (cancer-driving mutation in histones). The H3K27M oncohistone features a methionine substitution at lysine 27 on histone H3, and it appears in ~78% of cases of a type of paediatric glioma called diffuse intrinsic pontine glioma (DIPG). It was initially proposed that H3K27M causes a drastic loss of H3K27me3 due to strong interactions between PRC2 and H3K27M, however this has been challenged by more recent studies. In disease states like this, understanding the mechanisms governing how PRC2 and closely related proteins like histone H3 are regulated can be invaluable for the development of targeted therapeutics.

Mechanisms of PRC2 Recruitment: RNA, DNA, and Accessory Protein Interactions

PRC2 is a complex composed of four core protein subunits, EZH2 (or EZH1), EED, SUZ12, and RBBP4 (or RBBP7). EZH2 gives PRC2 catalytic activity, doing the direct work of depositing H3K27me3. EED is a regulatory subunit; certain factors can bind to EED to alter PRC2 activity. SUZ12 is a scaffold protein and provides structural rigidity and stability to the assembled complex. RBBP4 is involved in histone binding.

Although core PRC2 (referred to as PRC2 or PRC2-4m) has catalytic activity, this is not the form in which we find PRC2 in vivo. PRC2 also has what are known as accessory subunits: subunits which bind to PRC2 in addition to its four core subunits to regulate PRC2 activity. When PRC2 is assembled with accessory subunits to form an active functional complex, we can refer to it as holo-PRC2 (see holoproteins). PRC2 has been shown to form two distinct complexes — PRC2.1 or PRC2.2 — depending on what accessory subunits are bound. Relevant to my thesis, PRC2 has been shown to have RNA-binding activity, and a number of models have emerged for explaining the effect of RNA binding on the regulation of PRC2. PRC2 also demonstrates DNA-binding activity, which has been shown to be important for essential mechanisms such as its ability to deposit H3K27me3 and for nucleosome attachment.

Figure 3: PRC2 catalyses formation of the H3K27me3 histone mark associated with repressed chromatin. Core PRC2 is composed of the subunits EZH1/2, SUZ12, EED and RBBP4/7. PRC2 is also regulated by other factors, including the PRC2.1 accessory subunits PHF1, PHF19, MTF2, EPOP, and PALI1, and the PRC2.2 accessory subunits AEBP2 and JARID2. Nucleic acids such as DNA and RNA are also able to bind and regulate PRC2. However, much is unknown about how these factors regulate PRC2 mechanistically.

The PRC2.2 complex is defined by the presence of two accessory subunits: AEBP2 and JARID2. PRC2 can bind with just AEBP2 to form PRC2–AEBP2, or it can bind with both AEBP2 and JARID2 to form PRC2–AEBP2–JARID2. Both AEBP2 and JARID2 are DNA-binding proteins, and their presence changes how PRC2 interacts with chromatin. JARID2 facilitates crosstalk between PRC2 and PRC1 by recognising the H2AK119ub1 mark deposited by PRC1. This is one of the key methods PRC1 and PRC2 cooperate to establish and maintain gene silencing. JARID2 also binds DNA directly, which may help PRC2.2 engage chromatin independently of pre-existing histone marks.

Figure 4: PRC2 can form either the PRC2.1 or PRC2.2 complex depending on which accessory subunits are bound to the core complex. Subunits are shown approximately to scale relative to each other.

Figure 5: Diagram of the identified and predicted protein domains within the protein subunits of the PRC2 complex. EZH2 is the catalytic subunit. SUZ12 forms a scaffold which the subunits use to bind and assemble. EED is the regulatory subunit to which regulatory factors bind to regulate PRC2. RBBP4 is an additional core subunit that has a role in nucleosome binding. AEBP2 and JARID2 are essential accessory subunits that regulate PRC2 activity and recruitment and form part of the PRC2.2 complex. AEBP2 interacts with PRC2 via its C-terminal. JARID2 interacts with PRC2 via its N-terminal. PHF1, PHF19, MTF2, EPOP, and PALI1 are accessory subunits of the PRC2.1 complex. The PCL proteins (PHF1, PHF19, and MTF2) interact with PRC2 to form PRC2.1 via the N-terminal chromo-like (CL) domains. PRC2.1 accessory subunits EPOP and PALI1 interact with PRC2 via their C-terminal region (CTR) and PALI interaction with PRC2 (PIP) regions, respectively. Previously identified domains are marked in white. Uniprot codes for each protein sequence are shown next to the protein name.

Much remains unclear regarding how PRC2 activity and recruitment are regulated: What effect does RNA binding have on PRC2 regulation? Are PRC2.1 and PRC2.2 differentially regulated throughout development? What mechanisms underlie PRC2 occupation of target genes, and what causes PRC2 to ultimately methylate histones at these genetic loci? My PhD research aimed to address these questions.

in vitro and in vivo approaches compliment each other

Much of the work I conducted during my PhD involved what are referred to as in vitro assays (Latin for "in glass"). in vitro experiments are those performed outside of a living organism, typically in highly controlled environments like test tubes or flasks. The primary advantage of this approach is the precise control over experimental variables, such as reagent concentrations and solution buffers. This allows researchers to isolate specific biochemical interactions and molecular mechanisms without the confounding factors present in a complete biological system. Living systems contain countless variables that cannot be fully accounted for in an experimental design, introducing noise that can make interpreting direct mechanisms difficult.

Conversely, in vivo experiments (Latin for "in the living") describe observations and measurements taking place within a living cell or organism. Because in vivo experiments occur in their native biological context, they are essential for determining how a process operates within a complete, functioning physiological system.

in vitro and in vivo might first seem like they are at odds with each other, but they can actually be complementary. in vitro and in vivo approaches each dictate the type of experiments you can perform and the specifics of what you can discover. In both of the co-first author papers I published during my PhD, I contributed in vitro data, while our collaborators contributed in vivo (in cell culture) data. This combined approach provided a much more complete picture of the biological mechanisms at play. in vitro binding assays quantified, measured, and compared interactions and reaction rates in ways that are impossible within a living cell. Our collaborators' in vivo data was then used to confirm those observations in a biological context and yield additional insights that isolated systems cannot provide. In this sense, these two approaches are not necessarily at odds with each other, but are rather complementary and make the findings more robust when used together.

Modification of an in vivo Method to Reveal the RNA-Binding Site at the PRC2 Regulatory Centre

This work involved an international collaboration between teams at Monash University (Australia) and the University of Pennsylvania (USA) and was published in Nature Structural and Molecular Biology with myself as an equal-contribution co-first author (Zhang et al. 2019). This text focusses specifically on the work that I myself performed unless otherwise noted.

PRC2 activity is partly controlled by an allosteric regulatory site, which is a region on the surface of the complex away from the active site where binding of certain molecules can alter PRC2 regulation and activity. With PRC2, the active site is located within a domain of EZH2 called the SET domain where H3K27 binds and is methylated. The regulatory site of PRC2 is located within EED and interface between EED and EZH2. One known factor that increases PRC2 activity via the regulatory site is the histone mark H3K27me3. When H3K27me3 is bound to the allosteric regulatory site in EED, this increases PRC2 catalytic activity and stimulates PRC2 to methylate neighbouring nucleosomes. This is a feed-forward loop that helps spread gene silencing along chromatin, and to also maintain presence of H3K27me3. RNA had long been known to inhibit PRC2, but exactly where on PRC2 it bound was unclear.

To identify the exact residues that make up the RNA-binding site on PRC2, I modified an in vivo method from another study called "RBDmap". RBDmap is a cross-linking and mass spectrometry protocol that was originally performed to identify RNA-binding sites of proteins in cell cultures. With the original in vivo methods being unsuccessful in finding an RNA-binding site within PRC2, I adapted RBDmap into an in vitro methodology optimised for purified recombinant proteins and synthetic RNA that I transcribed using in vitro transcription. This required the production of core PRC2 and PRC2–AEBP2 complexes, which I expressed using the baculovirus/insect cell system and isolated via multi-step chromatography. By utilising purified complexes and synthetic RNA, I was able to bypass potential issues inherent to whole-cell methods, allowing for the reliable capture of cross-linked peptides from PRC2 that were bound to RNA. This work yielded the first residue-level map of RNA contacts on PRC2, localising the primary binding site to the regulatory centre at the EED-EZH2 interface.

With the purified PRC2 complexes, I also conducted electrophoretic mobility shift assays (EMSA) to confirm direct RNA binding and performed radioactive HMTase assays using Carbon-14-labelled SAM to assess the effect of RNA on methyltransferase function of PRC2. These assays utilised multiple substrate types (including reconstituted nucleosomes, free H3 histones, and non-histone substrates) to determine whether RNA inhibited PRC2 solely by competing for DNA binding (free H3 histones and non-histone substrates have no DNA).

These biochemical assays demonstrated that RNA inhibits both the PRC2.1 and PRC2.2 complexes through a mechanism that is entirely independent of competition for DNA binding. Crucially, we established that RNA and allosteric activators, such as H3K27me3 and JARID2-K116me3, bind to the same regulatory site to regulate the complex antagonistically. This antagonistic model can explain how the presence of stimulatory peptides at specific target genes allows PRC2 to overcome RNA-mediated inhibition and maintain targeted gene repression in the RNA-rich environment of the cell nucleus.

Auto-Inhibition of Polycomb Repressive Complex 2 by the AEBP2 Long Isoform

This work involved an international collaboration between teams at Monash University (Australia) and Trinity College Dublin (Ireland), and was published in The EMBO Journal with myself as an equal-contribution co-first author (Mucha et al. 2025). This text focusses specifically on the work that I myself performed unless otherwise noted.

PRC2 relies on accessory subunits that tune its behaviour. One such subunit, AEBP2, exists as two isoforms generated from the same gene: a longer version (AEBP2^L) found mainly in somatic (body) cells, and a shorter version (AEBP2^S) enriched in embryonic stem cells. In the literature, experiments using AEBP2 either focus on the short or long isoform, and refer to the isoform they used as just "AEBP2" with no long or short designation. No one had yet focussed on determining or differentiating between the two isoforms. This represents a problem: if these two isoforms of AEBP2 have opposing functions, the conclusion drawn from a study may be substantially different depending on the isoform used. Therefore, I expressed and purified PRC2 in complex with the long and short AEBP2 isoform separately (creating purified PRC2–AEBP2^L and PRC2–AEBP2^S complexes), and performed assays to determine differences in DNA-binding and HMTase activity.

I found that the two main isoforms of AEBP2 exert opposing effects on PRC2 activity. Using electrophoretic mobility shift assays (EMSA), fluorescence anisotropy (a technique using polarised light to quantify protein-DNA binding affinity in solution), and MTase-Glo methyltransferase activity assays, I determined that the AEBP2^S isoform promotes the DNA-binding ability of PRC2 and exhibits robust enzymatic activity. Conversely, I found that AEBP2^L, which features long tracts of negatively charged (acidic) residues within its extended N-terminus, strongly inhibits DNA binding and nearly abolishes PRC2 catalytic activity on chromatin substrates.

To determine the structural basis of this inhibition, we hypothesised that the negatively charged acidic tracts within the disordered N-terminus of AEBP2^L actively repel DNA, thereby obstructing PRC2 binding via electrostatic repulsion. To test this, I computationally designed a series of mutant AEBP2^L constructs where these acidic tracts were either deleted, mutated to all-alanine (neutral) tracts, or mutated to all-lysine (positively charged) tracts. Other members of the laboratory subsequently cloned, expressed, and purified these mutant PRC2–AEBP2^L complexes. They then evaluated these mutant complexes using fluorescence anisotropy and 14C-autoradiography HMTase assays. These assays confirmed that the acidic tracts in the N-terminus of AEBP2^L are directly responsible for the auto-inhibitory effect on PRC2 activity.

To determine if this auto-inhibitory behaviour held true inside living cells, our collaborators engineered isoform-specific CRISPR-Cas9 knockouts in mouse embryonic stem cells (mESCs). Their in vivo experiments showed that AEBP2^S actively promotes PRC2's recruitment to target genes, which is essential for initiating gene repression during early embryonic differentiation. Using CRISPR-Cas9, it was also possible to isolate the biological effect of AEBP2^L by knocking out AEBP2^S. In doing so, our collaborators found that in contrast with AEBP2^S, AEBP2^L actively antagonises PRC2. Then by knocking out AEBP2^L and isolating AEBP2^S, they observed an abnormal increase in PRC2 binding and H3K27me3 deposition at target genes. These findings from our collaborators' in vivo data and my in vitro biochemical assay data were found to be highly complementary.

Given that AEBP2^L is an isoform exclusively conserved in vertebrate species, we suggested in the published paper that the long isoform may have evolved as a restraint mechanism, keeping PRC2 in check in differentiated cells where inappropriate gene silencing could be harmful.

Overcoming Purification Challenges to Reveal PRC2–JARID2 Activation Dynamics

This work involved an international collaboration between teams at Yokohama City University and RIKEN in Japan, and Monash University (Australia). It was published in the Journal of Molecular Biology with myself as third author (Ohtomo et al. 2023). This text focusses specifically on the work that I myself performed unless otherwise noted.

The PRC2–AEBP2–JARID2 complex is a type of PRC2.2 complex which includes core PRC2 and both accessory subunits AEBP2 and JARID2. PRC2–AEBP2–JARID2 is thought to be the version of PRC2 responsible for establishing new domains of gene silencing. Purifying the full six-subunit complex with full-length JARID2 proved to be a considerable technical challenge during my PhD. JARID2 is a large 165 kDa protein with extensive disordered regions that complicate recombinant expression and purification. When I first started trying to purify a PRC2–AEBP2–JARID2 complex, only a couple of other research groups worldwide had published assays showing PRC2 in complex with full-length JARID2. Many studies use truncated JARID2, which is possibly because full-length JARID2 is not needed for the purpose of a particular study, or because of technical difficulties involved.

Our research group was initially using a protocol for expressing PRC2–AEBP2 (without JARID2) also for purifying the PRC2–AEBP2–JARID2 complex. We found this protocol was always highly reliable for PRC2–AEBP2, but very inconsistent for PRC2–AEBP2–JARID2. Using this PRC2–AEBP2 protocol for PRC2–AEBP2–JARID2, I would frequently find that JARID2 would not appear via SDS-PAGE after lysing the cells. I made considerable efforts troubleshooting and modifying both our laboratory's baculovirus protein expression and PRC2 complex purification protocols to determine the cause. Taking into account the large size and general disordered structure of JARID2, I hypothesised that JARID2 may be particularly susceptible to proteolysis by proteases present in the cell lysate (cell contents after lysis), despite the addition of a protease inhibitor. After narrowing the most likely cause down to proteolysis in the lysis buffer, I started testing different lysis buffer conditions side by side. After testing many conditions, I achieved good stoichiometric incorporation of full-length JARID2 as part of an assembled and highly pure PRC2–AEBP2–JARID2 complex through a number of changes to the lysis and affinity column wash buffers, namely through addition of 15% v/v glycerol and increasing the pH from 7.5 to 8.0.

Beyond resolving the proteolysis issue, I also optimised the expression system to improve efficiency and reduce costs. Rather than adopting the Spodoptera frugiperda (Sf9) cell lines used by the few other groups successful with full-length JARID2, I deliberately adapted the protocol for our existing Trichoplusia ni (High Five) insect cells. While Sf9 cells are the industry standard for initial viral stock amplification due to certain properties, High Five cells are typically better suited for high-yield bulk protein manufacturing. This decision maintained essential consistency across our laboratory's purification pipelines as well, as we use High Five cells for all of our other PRC2 expression. Typically I would generate a large volume (12-litres) of cell culture before inducing protein expression for PRC2–AEBP2–JARID2 due to low yield, requiring a lot of growth media. Given that the growth media used for High Five cells (Insect Xpress) is cheaper than the growth media for Sf9 cells (Sf-900 II/III SFM), this also introduced a considerable cost reduction for future expression of this complex. Other than these crucial modifications, expression and purification of PRC2–AEBP2–JARID2 could then be performed similarly to PRC2–AEBP2. Having this complete complex in hand for the first time in our lab enabled direct biochemical comparisons using the same HMTase and binding assays used in my other projects.

Shortly after I was able to purify PRC2–AEBP2–JARID2, our lab was contacted by a research group at RIKEN in Japan, who heard through word of mouth that our group had newly been able to purify a high-quality PRC2–AEBP2–JARID2 complex with full-length JARID2. We then discussed and set up a collaboration with them, where my new PRC2–AEBP2–JARID2 expression and purification protocol would be essential for performing assays that would complement some data that they were collecting.

I was then responsible for producing the required PRC2 protein complexes, assembling the nucleosome substrates, and executing the carbon-14 HMTase assays. To isolate the variables driving PRC2 activity, I assembled unmodified nucleosomes and a labmate modified a separate batch of them with the H2AK119ub1 mark. Modified and unmodified nucleosomes also either had either no linker DNA (NCPs or 'nucleosome core particles" featuring 143 base pair long DNA) or featured DNA unbound to the nucleosome which could potentially interact with PRC2 (referred to in the paper as "nucleosomes" featuring 193 base pair long DNA).

The central finding from my functional assays was that linker DNA stimulated PRC2.2 catalytic activity to a degree comparable to H2AK119ub1. Historically, the H2AK119ub1 mark was considered one of the primary recruitment and activation signals for PRC2. However, these assays pointed to JARID2's DNA-binding activity as an equally crucial, underappreciated driver of PRC2 function. This biochemical baseline helps explain how PRC2.2 can gain a foothold at new genomic targets to initiate gene silencing even before repressive histone marks have been established (de novo repression).

While my in vitro assays show that linker DNA and H2A ubiquitylation work synergistically to massively enhance PRC2 activity, an important question remains: how does this happen at a structural level? To answer this, our collaborators used an advanced technique called Nuclear Magnetic Resonance (NMR) spectroscopy to measure the physical movements of the proteins in solution. They found that in unmodified nucleosomes, the long, flexible H3 histone tail dynamically binds to the linker DNA, essentially hiding the H3K27 target site from the PRC2 complex. However, when PRC1 tags H2A to form H2AK119ub1, this mark physically displaces the H3 histone tail from the linker DNA. This shift in dynamics frees the H3 histone tail and poises it to be fed into the active site of the PRC2–AEBP2–JARID2 complex. By combining my biochemical functional data with their biophysical structural data, we were able to provide a unified model explaining how the cross-talk between PRC1, ubiquitin marks, and linker DNA structurally regulates gene silencing.

Basic Research: How Expanding The Boundaries of Human Knowledge Allows For Targeting Of Complex Diseases

All of the research that I undertook during my PhD can be defined as "basic research". Basic research is the pursuit of fundamental scientific knowledge without the primary aim being immediately commercial or practical. While many may want to prioritise projects with immediate and predictable outcomes, this viewpoint fundamentally misunderstands the pipeline of innovation. The translation of science into applied breakthroughs requires a pre-existing reservoir of knowledge. For example, the mathematical algorithms that underpin modern Wi-Fi technology were initially developed by radio astronomers at the CSIRO attempting to detect signals from exploding mini black holes. Sometime in 2021, I remember reading a social media post about a PhD candidate defending their research in 2019 on coronaviridae at their dissertation defence. They told a story about getting mercilessly grilled by a member of their thesis committee on the practical relevance and impact of studying an obscure family of viruses. This foundation of "basic research" was seen as not important at all, until suddenly it was the most important thing in the world. Knowledge fundamentally drives future innovation, making basic research as essential as it is frequently underappreciated. Funding and supporting basic research in balance with applied research is vital for successful outcomes in both.

Understanding how proteins like PRC2 work at a mechanistic level can directly inform the development of targeted therapies. Each mechanism disrupted in a disease state represents a potential drug target, but developing medicines that correct these problems requires foundational biochemical knowledge. Modern therapeutics span many approaches (small molecule inhibitors, monoclonal antibodies, nucleic acid therapies, mRNA-based treatments, and others), and the common thread is that each depends on the kind of foundational research described in this post.

For PRC2, this basic research has already translated into some approved drugs. There are a number of therapeutics currently being used for diseases where PRC2 function is affected. The first generation of approved PRC2 therapeutics focused on small molecules that competitively bind the SAM pocket of the catalytic subunits to halt methylation.

Tazemetostat: A highly selective, SAM-competitive EZH2 inhibitor. It is the first FDA-approved PRC2-targeted therapy, utilised clinically for metastatic or locally advanced epithelioid sarcoma and relapsed/refractory follicular lymphoma, particularly in cases driven by activating EZH2 gain-of-function mutations (such as Y641).
Valemetostat: A dual inhibitor targeting both EZH1 and EZH2. It was developed to overcome the paralog compensation effect where EZH1 is upregulated to maintain H3K27me3 levels when EZH2 is exclusively blocked. It has received regulatory approvals for specific hematological malignancies, including relapsed/refractory adult T-cell leukemia/lymphoma (ATL) and peripheral T-cell lymphoma (PTCL).

These inhibitors are great medical breakthroughs. So now that we have them, why go through the effort of finding out more about mechanisms of PRC2 function? Unfortunately, evolutionary drug resistance and system toxicity can halt their effectiveness. Cancers can be highly mutable, meaning they can escape the inhibitory effect of the drugs by mutating. Additionally, small molecule inhibitors can have very broad effects where PRC2 activity is inhibited too much globally and can induce off-target toxicity and side effects. Different drug modalities other than small molecule inhibitors may be able to more finely tune PRC2 activity, rather than inhibit it. How much we know about the mechanisms for PRC2 regulation will inform the types of therapeutics we can develop and what they can target. Finally, a large and broad range of therapeutics targeting many different mechanisms of PRC2 function can provide multiple lines of defence and give patients a much better prognosis.

Computational and Bioinformatic Approaches in My PhD

Alongside my bench work, I found myself repeatedly hitting bottlenecks in data processing and sequence design that didn't have ready-made solutions. I had already been learning R and Python as a hobby for some time prior to my PhD. So, rather than work around these issues, I was eager to use R and Python to build my own tools. I found that I enjoyed the problem-solving involved in programming at least as much as the wet lab experiments themselves. Both tools available on my GitHub.

RBDmap Lite A basic R Shiny web application I finished building in 2019 to process my RBDmap mass spectrometry data. It takes LC-MS/MS output files processed through MaxQuant and maps them against a protein FASTA file to identify RNA-binding sites at near-residue resolution. The app generates per-subunit hit plots and a colour-coded protein sequence showing the number of crosslinking hits at each residue. I deployed a live version at n-j-mckenzie.shinyapps.io/RBDmap_Lite, so the analysis can be run directly in a browser without any local installation.
DNA Complexity Filter DNA Complexity Filter is a Python tool I wrote in 2020 to optimise codon usage when designing synthetic DNA sequences for gene synthesis. Vendors such as IDT (gBlocks) and GenScript impose strict constraints on GC content and repetitive sequences, and manually designing compliant sequences was tedious to do manually. The script randomly generates codon combinations encoding the desired protein sequence and filters them against user-specified thresholds for overall GC content, local GC content across sliding windows, and maximum repeat length. This saved considerable time when I needed to design mutant constructs for the AEBP2 project.