
mXSS: The Vulnerability Hiding in Your Code
Cross-site scripting (XSS) is a well-known vulnerability type that occurs when an attacker can inject JavaScript code into a vulnerable page. When an unknowing victim visits the page, the injected code is executed in the victim’s session. The impact of this attack could vary depending on the application, with no business impact to account takeover (ATO), data leak, or even remote code execution (RCE).
There are various types of XSS, such as reflected, stored, and universal. But in recent years, the mutation class of XSS has become feared for bypassing sanitizers, such as DOMPurify, Mozilla bleach, Google Caja, and more… affecting numerous applications, including Google Search. To this day, we see many applications that are susceptible to these kinds of attacks.
But what is mXSS?
(We also explored this topic in our Insomnihack 2024 talk: Beating The Sanitizer: Why You Should Add mXSS To Your Toolbox.)
Background
If you are a web developer, you have probably integrated or even implemented some kind of sanitization to protect your application from XSS attacks. But little is known about how difficult it is to make a proper HTML sanitizer. The goal of an HTML sanitizer is to ensure that user-generated content, such as text input or data obtained from external sources, does not pose any security risks or disrupt the intended functionality of a website or application.
One of the main challenges in implementing an HTML sanitizer lies in the complex nature of HTML itself. HTML is a versatile language with a wide range of elements, attributes, and potential combinations that can affect the structure and behavior of a webpage. Parsing and analyzing HTML code accurately while preserving its intended functionality can be a daunting task.
HTML
Before getting into the subject of mXSS, let's first have a look at HTML, the markup language that forms the foundation of web pages. Understanding HTML's structure and how it works is crucial since mXSS (mutation Cross-Site Scripting) attacks utilize quirks and intricacies of HTML.
HTML is considered a tolerant language because of its forgiving nature when it encounters errors or unexpected code. Unlike some stricter programming languages, HTML prioritizes displaying content even if the code isn't perfectly written. Here's how this tolerance plays out:
When a broken markup is rendered, instead of crashing or displaying an error message, browsers attempt to interpret and fix the HTML as best as they can, even if it contains minor syntax errors or missing elements. For instance, opening the following markup in the browser <p>test
will execute as expected despite missing a closing p
tag. When looking at the final page’s HTML code we can see that the parser fixed our broken markup and closed the p
element by itself: <p>test</p>
.
Why it's Tolerant:
- Accessibility: The web should be accessible to everyone, and minor errors in HTML shouldn't prevent users from seeing the content. Tolerance allows for a wider range of users and developers to interact with the web.
- Flexibility: HTML is often used by people with varying levels of coding experience. Tolerance allows for some sloppiness or mistakes without completely breaking the page's functionality.
- Backward Compatibility: The web is constantly evolving, but many existing websites are built with older HTML standards. Tolerance ensures that these older sites can still be displayed in modern browsers, even if they don't adhere to the latest specifications.
But how does our HTML parser know in which way to “fix” a broken markup? Should <a><b>
become<a></a><b></b>
or <a><b></b></a>
?
To answer this question there is a well-documented HTML specification, but unfortunately, there are still some ambiguities that result in different HTML parsing behaviors even between major browsers today.
Mutation
OK, so HTML can tolerate broken markup how is this relevant?
The M in mXSS stands for “mutation”, and mutation in HTML is any kind of change made to the markup for some reason or another.
- When a parser fixes a broken markup (
<p>test
→<p>test</p>
), that's a mutation. - Normalizing attribute quotes (
<a alt=test>
→<a alt=”test”>
), that's a mutation. - Rearranging elements (
<table><a>
→<a></a><table></table>
), that's a mutation - And so on…
mXSS takes advantage of this behavior in order to bypass sanitization, we will showcase examples in the technical details.
HTML Parsing Background
Summarizing HTML parsing, a 1500~ page-long standard, into one section is not realistic. However, due to its importance for understanding in-depth mXSS and how payloads work, we must cover at least some major topics. To make things easier, we've developed an mXSS cheatsheet (coming later in this blog) that condenses the hefty standard into a more manageable resource for researchers and developers.
Different content parsing types
HTML isn't a one-size-fits-all parsing environment. Elements handle their content differently, with seven distinct parsing modes at play. We'll explore these modes to understand how they influence mXSS vulnerabilities:
- void elements
area
,base
,br
,col
,embed
,hr
,img
,input
,link
,meta
,source
,track
,wbr
- the
template
elementtemplate
- Raw text elements
script
,style
,noscript
,xmp
,iframe
,noembed
,noframes
- Escapable raw text elements
textarea
,title
- Foreign content elements
svg
,math
- Plaintext state
plaintext
- Normal elements
- All other allowed HTML elements are normal elements.
We can fairly easily demonstrate a difference between parsing types using the following example:
- Our first input is a
div
element, which is a “normal element” element:<div><a alt="</div><img src=x onerror=alert(1)>">
- On the other hand, the second input is a similar markup using the
style
element instead (which is a “raw text”):<style><a alt="</style><img src=x onerror=alert(1)>">
Looking at the parsed markup we can clearly see the parsing differences:
The content of the div
element is rendered as HTML, an a
element is created. What seems to be a closing div
and an img
tag is actually an attribute value of the a
element, thus rendered as alt
text for the a
element and not HTML markup. In the style
example, the content of the style
element is rendered as raw text, so no a
element is created, and the alleged attribute is now normal HTML markup.
Foreign content elements
HTML5 introduced new ways to integrate specialized content within web pages. Two key examples are the <svg>
and <math>
elements. These elements leverage distinct namespaces, meaning they follow different parsing rules compared to standard HTML. Understanding these different parsing rules is crucial for mitigating potential security risks associated with mXSS attacks.
Let's take a look at the same example as before but this time encapsulated inside an svg
element:
<svg><style><a alt="</style><img src=x onerror=alert(1)>">
In this case, we do see an a
element being created. The style
element doesn’t follow the “raw text” parsing rules, because it is inside a different namespace. When residing within an SVG or MathML namespace, the parsing rules change and no longer follow the HTML language.
Using namespace confusion techniques (such as DOMPurify 2.0.0 bypass) attackers can manipulate the sanitizer to parse content in a different way than how it will be rendered eventually by the browser, evading detection of malicious elements.
From Mutations to Vulnerabilities
Often times the mXSS term is used in a broad way when covering various sanitizer bypasses. For better understanding, we will split the general term “mXSS” into 4 different subcategories
Parser differentials
Though parser differentials can be referred to as usual sanitizer bypass, sometimes it is referred to as mXSS. Either way, an attacker can take advantage of a parser mismatch between the sanitizer’s algorithm vs the renderer’s (e.g. browser). Due to the complexity of HTML parsing, having parsing differentials doesn’t necessarily mean that one parser is wrong while the other is right.
Let’s take for example the noscript element, the parsing rule for it is: “If the scripting flag is enabled, switch the tokenizer to the RAWTEXT state. Otherwise, leave the tokenizer in the data state.” (link) Meaning, that depending on whether JavaScript is disabled or enabled the body of the noscript
element is rendered differently. It is logical that JavaScript would not be enabled in the sanitizer stage but will be in the renderer. This behavior is not wrong by definition but could cause bypasses such as: <noscript><style></noscript><img src=x onerror=”alert(1)”>
JS disabled:
JS enabled:
Many other parser differentials, such as different HTML versions, content type mismatches, and more, could occur.
Parsing round trip
Parsing round trip is a well-known and documented phenomenon, that says: “It is possible that the output of this algorithm if parsed with an HTML parser, will not return the original tree structure. Tree structures that do not roundtrip a serialize and reparse step can also be produced by the HTML parser itself, although such cases are typically non-conforming.”
Meaning that according to the number of times we parse an HTML markup the resulting DOM tree could change.
Let's take a look at the official example provided in the specification:
But first, we need to understand that a form
element cannot have another form
nested inside of it: “Content model: Flow content, but with no form element descendants.“ (as written in the specs)
But if we just continue to read the documentation they give an example of how form
elements can be nested, by the following markup: