HTML API: Scan all syntax tokens in a document, read modifiable text.

Since its introduction in WordPress 6.2 the HTML Tag Processor has provided a way to scan through all of the HTML tags in a document and then read and modify their attributes. In order to reliably do this, it also needed to be aware of other kinds of HTML syntax, but it didn't expose those syntax tokens to consumers of the API. In this patch the Tag Processor introduces a new scanning method and a few helper methods to read information about or from each token. Most significantly, this introduces the ability to read `#text` nodes in the document. What's new in the Tag Processor? ================================ - `next_token()` visits every distinct syntax token in a document. - `get_token_type()` indicates what kind of token it is. - `get_token_name()` returns something akin to `DOMNode.nodeName`. - `get_modifiable_text()` returns the text associated with a token. - `get_comment_type()` indicates why a token represents an HTML comment. Example usage. ============== {{{ <?php function strip_all_tags( $html ) { $text_content = ''; $processor = new WP_HTML_Tag_Processor( $html ); while ( $processor->next_token() ) { if ( '#text' !== $processor->get_token_type() ) { continue; } $text_content .= $processor->get_modifiable_text(); } return $text_content; } }}} What changes in the Tag Processor? ================================== Previously, the Tag Processor would scan the opening and closing tag of every HTML element separately. Now, however, there are special tags which it only visits once, as if those elements were void tags without a closer. These are special tags because their content contains no other HTML or markup, only non-HTML content. - SCRIPT elements contain raw text which is isolated from the rest of the HTML document and fed separately into a JavaScript engine. There are complicated rules to avoid escaping the script context in the HTML. The contents are left verbatim, and character references are not decoded. - TEXTARA and TITLE elements contain plain text which is decoded before display, e.g. transforming `&` into `&`. Any markup which resembles tags is treated as verbatim text and not a tag. - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the textarea and title elements, but no character references are decoded. For example, `&` inside a STYLE element is passed to the CSS engine as the literal string `&` and _not_ as `&`. Because it's important not treat this inner content separately from the elements containing it, the Tag Processor combines them when scanning into a single match and makes their content available as modifiable text (see below). This means that the Tag Processor will no longer visit a closing tag for any of these elements unless that tag is unexpected. {{{ <title>There is only a single token in this line</title> <title>There are two tokens in this line></title></title> </title><title>There are still two tokens in this line></title> }}} What are tokens? ================ The term "token" here is a parsing term, which means a primitive unit in HTML. There are only a few kinds of tokens in HTML: - a tag has a name, attributes, and a closing or self-closing flag. - a text node, or `#text` node contains plain text which is displayed in a browser and which is decoded before display. - a DOCTYPE declaration indicates how to parse the document. - a comment is hidden from the display on a page but present in the HTML. There are a few more kinds of tokens that the HTML Tag Processor will recognize, some of which don't exist as concepts in HTML. These mostly comprise XML syntax elements that aren't part of HTML (such as CDATA and processing instructions) and invalid HTML syntax that transforms into comments. What is a funky comment? ======================== This patch treats a specific kind of invalid comment in a special way. A closing tag with an invalid name is considered a "funky comment." In the browser these become HTML comments just like any other, but their syntax is convenient for representing a variety of bits of information in a well-defined way and which cannot be nested or recursive, given the parsing rules handling this invalid syntax. - `</1>` - `</%avatar_url>` - `</{"wp_bit": {"type": "post-author"}}>` - `</[post-author]>` - `</__( 'Save Post' );>` All of these examples become HTML comments in the browser. The content inside the funky content is easily parsable, whereby the only rule is that it starts at the `<` and continues until the nearest `>`. There can be no funky comment inside another, because that would imply having a `>` inside of one, which would actually terminate the first one. What is modifiable text? ======================== Modifiable text is similar to the `innerText` property of a DOM node. It represents the span of text for a given token which may be modified without changing the structure of the HTML document or the token. There is currently no mechanism to change the modifiable text, but this is planned to arrive in a later patch. Tags ==== Most tags have no modifiable text because they have child nodes where text nodes are found. Only the special tags mentioned above have modifiable text. {{{ <div class="post">Another day in HTML</div> └─ tag ──────────┘└─ text node ─────┘└────┴─ tag }}} {{{ <title>Is <img> > <image>?</title> │ └ modifiable text ───┘ │ "Is <img> > <image>?" └─ tag ─────────────────────────────┘ }}} Text nodes ========== Text nodes are entirely modifiable text. {{{ This HTML document has no tags. └─ modifiable text ───────────┘ }}} Comments ======== The modifiable text inside a comment is the portion of the comment that doesn't form its syntax. This applies for a number of invalid comments. {{{  │ └─ modifiable text ──────┘ │ └─ comment token ───────────────┘ }}} {{{  │ └─ modifiable text ────────┘ │ └─ comment token ───────────────┘ }}} {{{ <[CDATA[this is an invalid comment]]> │ └─ modifiable text ───────┘ │ └─ comment token ───────────────────┘ }}} Other token types also have modifiable text. Consult the code or tests for further information. Developed in https://github.com/WordPress/wordpress-develop/pull/5683 Discussed in https://core.trac.wordpress.org/ticket/60170 Follows [57575] Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam Fixes #60170 git-svn-id: https://develop.svn.wordpress.org/trunk@57348 602fd350-edb4-49c9-b593-d223f7449a82
2026-02-24 09:42:45 +00:00 · 2024-01-24 23:35:46 +00:00 · 2024-01-24 23:35:46 +00:00 · 616e673d3e
commit 616e673d3e
parent 6daf853022
5 changed files with 1581 additions and 149 deletions
--- a/src/wp-includes/html-api/class-wp-html-processor.php
+++ b/src/wp-includes/html-api/class-wp-html-processor.php
@ -149,17 +149,6 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 	 */
 	const MAX_BOOKMARKS = 100;

-	/**
-	 * Static query for instructing the Tag Processor to visit every token.
-	 *
-	 * @access private
-	 *
-	 * @since 6.4.0
-	 *
-	 * @var array
-	 */
-	const VISIT_EVERYTHING = array( 'tag_closers' => 'visit' );
-
 	/**
 	 * Holds the working state of the parser, including the stack of
 	 * open elements and the stack of active formatting elements.
@ -424,6 +413,30 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 		return false;
 	}

+	/**
+	 * Ensures internal accounting is maintained for HTML semantic rules while
+	 * the underlying Tag Processor class is seeking to a bookmark.
+	 *
+	 * This doesn't currently have a way to represent non-tags and doesn't process
+	 * semantic rules for text nodes. For access to the raw tokens consider using
+	 * WP_HTML_Tag_Processor instead.
+	 *
+	 * @since 6.5.0 Added for internal support; do not use.
+	 *
+	 * @access private
+	 *
+	 * @return bool
+	 */
+	public function next_token() {
+		$found_a_token = parent::next_token();
+
+		if ( '#tag' === $this->get_token_type() ) {
+			$this->step( self::REPROCESS_CURRENT_NODE );
+		}
+
+		return $found_a_token;
+	}
+
 	/**
 	 * Indicates if the currently-matched tag matches the given breadcrumbs.
 	 *
@ -520,7 +533,9 @@ class WP_HTML_Processor extends WP_HTML_Tag_Processor {
 				$this->state->stack_of_open_elements->pop();
 			}

-			parent::next_tag( self::VISIT_EVERYTHING );
+			while ( parent::next_token() && '#tag' !== $this->get_token_type() ) {
+				continue;
+			}
 		}

 		// Finish stepping when there are no more tokens in the document.
--- a/src/wp-includes/html-api/class-wp-html-tag-processor.php
+++ b/src/wp-includes/html-api/class-wp-html-tag-processor.php
--- a/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php
+++ b/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php
@ -514,7 +514,11 @@ class Tests_HtmlApi_WpHtmlProcessorBreadcrumbs extends WP_UnitTestCase {
 	 * @covers WP_HTML_Processor::seek
 	 */
 	public function test_can_seek_back_and_forth() {
-		$p = WP_HTML_Processor::create_fragment( '<div><p one><div><p><div two><p><div><p><div><p three>' );
+		$p = WP_HTML_Processor::create_fragment(
+			<<<'HTML'
+<div>text<p one>more stuff<div><![CDATA[this is not real CDATA]]><p><!-- hi --><div two><p><div><p>three comes soon<div><p three>' );
+HTML
+		);

 		// Find first tag of interest.
 		while ( $p->next_tag() && null === $p->get_attribute( 'one' ) ) {
--- a/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php
+++ b/tests/phpunit/tests/html-api/wpHtmlTagProcessor-token-scanning.php
@ -0,0 +1,734 @@
+<?php
+/**
+ * Unit tests covering WP_HTML_Tag_Processor token-scanning functionality.
+ *
+ * @package WordPress
+ * @subpackage HTML-API
+ *
+ * @since 6.5.0
+ *
+ * @group html-api
+ *
+ * @coversDefaultClass WP_HTML_Tag_Processor
+ */
+class Tests_HtmlApi_WpHtmlProcessor_Token_Scanning extends WP_UnitTestCase {
+	/**
+	 * Ensures that scanning finishes in a complete form when the document is empty.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_completes_empty_document() {
+		$processor = new WP_HTML_Tag_Processor( '' );
+
+		$this->assertFalse(
+			$processor->next_token(),
+			"Should not have found any tokens but found {$processor->get_token_type()}."
+		);
+	}
+
+	/**
+	 * Ensures that normative text nodes are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_text_node() {
+		$processor = new WP_HTML_Tag_Processor( 'Hello, World!' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#text',
+			$processor->get_token_type(),
+			"Should have found #text token type but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'Hello, World!',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative Elements are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_element() {
+		$processor = new WP_HTML_Tag_Processor( '<div id="test" inert>Hello, World!</div>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'DIV',
+			$processor->get_token_name(),
+			"Should have found DIV tag name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			'test',
+			$processor->get_attribute( 'id' ),
+			"Should have found id attribute value 'test' but found {$processor->get_attribute( 'id' )} instead."
+		);
+
+		$this->assertTrue(
+			$processor->get_attribute( 'inert' ),
+			"Should have found boolean attribute 'inert' but didn't."
+		);
+
+		$attributes     = $processor->get_attribute_names_with_prefix( '' );
+		$attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+		$this->assertSame(
+			array( 'id', 'inert' ),
+			$attributes,
+			'Should have found only two attributes but found ' . implode( ', ', $attribute_list ) . ' instead.'
+		);
+
+		$this->assertSame(
+			'',
+			$processor->get_modifiable_text(),
+			"Should have found empty modifiable text but found '{$processor->get_modifiable_text()}' instead."
+		);
+	}
+
+	/**
+	 * Ensures that normative SCRIPT elements are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_script_element() {
+		$processor = new WP_HTML_Tag_Processor( '<script type="module">console.log( "Hello, World!" );</script>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'SCRIPT',
+			$processor->get_token_name(),
+			"Should have found SCRIPT tag name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			'module',
+			$processor->get_attribute( 'type' ),
+			"Should have found type attribute value 'module' but found {$processor->get_attribute( 'type' )} instead."
+		);
+
+		$attributes     = $processor->get_attribute_names_with_prefix( '' );
+		$attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+		$this->assertSame(
+			array( 'type' ),
+			$attributes,
+			"Should have found single 'type' attribute but found " . implode( ', ', $attribute_list ) . ' instead.'
+		);
+
+		$this->assertSame(
+			'console.log( "Hello, World!" );',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative TEXTAREA elements are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_textarea_element() {
+		$processor = new WP_HTML_Tag_Processor(
+			<<<HTML
+<textarea rows=30 cols="80">
+Is <HTML> &gt; XHTML?
+</textarea>
+HTML
+		);
+		$processor->next_token();
+
+		$this->assertSame(
+			'TEXTAREA',
+			$processor->get_token_name(),
+			"Should have found TEXTAREA tag name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			'30',
+			$processor->get_attribute( 'rows' ),
+			"Should have found rows attribute value 'module' but found {$processor->get_attribute( 'rows' )} instead."
+		);
+
+		$this->assertSame(
+			'80',
+			$processor->get_attribute( 'cols' ),
+			"Should have found cols attribute value 'module' but found {$processor->get_attribute( 'cols' )} instead."
+		);
+
+		$attributes     = $processor->get_attribute_names_with_prefix( '' );
+		$attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+		$this->assertSame(
+			array( 'rows', 'cols' ),
+			$attributes,
+			'Should have found only two attributes but found ' . implode( ', ', $attribute_list ) . ' instead.'
+		);
+
+		// Note that the leading newline should be removed from the TEXTAREA contents.
+		$this->assertSame(
+			"Is <HTML> > XHTML?\n",
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative TITLE elements are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_title_element() {
+		$processor = new WP_HTML_Tag_Processor(
+			<<<HTML
+<title class="multi-line-title">
+Is <HTML> &gt; XHTML?
+</title>
+HTML
+		);
+		$processor->next_token();
+
+		$this->assertSame(
+			'TITLE',
+			$processor->get_token_name(),
+			"Should have found TITLE tag name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			'multi-line-title',
+			$processor->get_attribute( 'class' ),
+			"Should have found class attribute value 'multi-line-title' but found {$processor->get_attribute( 'rows' )} instead."
+		);
+
+		$attributes     = $processor->get_attribute_names_with_prefix( '' );
+		$attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+		$this->assertSame(
+			array( 'class' ),
+			$attributes,
+			'Should have found only one attribute but found ' . implode( ', ', $attribute_list ) . ' instead.'
+		);
+
+		$this->assertSame(
+			"\nIs <HTML> > XHTML?\n",
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative RAWTEXT elements are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 *
+	 * @dataProvider data_rawtext_elements
+	 *
+	 * @param string $tag_name The name of the RAWTEXT tag to test.
+	 */
+	public function test_basic_assertion_rawtext_elements( $tag_name ) {
+		$processor = new WP_HTML_Tag_Processor(
+			<<<HTML
+<{$tag_name} class="multi-line-title">
+Is <HTML> &gt; XHTML?
+</{$tag_name}>
+HTML
+		);
+		$processor->next_token();
+
+		$this->assertSame(
+			$tag_name,
+			$processor->get_token_name(),
+			"Should have found {$tag_name} tag name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			'multi-line-title',
+			$processor->get_attribute( 'class' ),
+			"Should have found class attribute value 'multi-line-title' but found {$processor->get_attribute( 'rows' )} instead."
+		);
+
+		$attributes     = $processor->get_attribute_names_with_prefix( '' );
+		$attribute_list = array_map( 'Tests_HtmlApi_WpHtmlProcessor_Token_Scanning::quoted', $attributes );
+		$this->assertSame(
+			array( 'class' ),
+			$attributes,
+			'Should have found only one attribute but found ' . implode( ', ', $attribute_list ) . ' instead.'
+		);
+
+		$this->assertSame(
+			"\nIs <HTML> &gt; XHTML?\n",
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Data provider.
+	 *
+	 * @return array[].
+	 */
+	public function data_rawtext_elements() {
+		return array(
+			'IFRAME'   => array( 'IFRAME' ),
+			'NOEMBED'  => array( 'NOEMBED' ),
+			'NOFRAMES' => array( 'NOFRAMES' ),
+			'STYLE'    => array( 'STYLE' ),
+			'XMP'      => array( 'XMP' ),
+		);
+	}
+
+	/**
+	 * Ensures that normative CDATA sections are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_cdata_section() {
+		$processor = new WP_HTML_Tag_Processor( '<![CDATA[this is a comment]]>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_name(),
+			"Should have found comment token but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			WP_HTML_Processor::COMMENT_AS_CDATA_LOOKALIKE,
+			$processor->get_comment_type(),
+			'Should have detected a CDATA-like invalid comment.'
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			'this is a comment',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that abruptly-closed CDATA sections are properly parsed as comments.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_abruptly_closed_cdata_section() {
+		$processor = new WP_HTML_Tag_Processor( '<![CDATA[this is > a comment]]>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_name(),
+			"Should have found a bogus comment but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			'[CDATA[this is ',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+
+		$processor->next_token();
+
+		$this->assertSame(
+			'#text',
+			$processor->get_token_name(),
+			"Should have found text node but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			' a comment]]>',
+			$processor->get_modifiable_text(),
+			'Should have found remaining syntax from abruptly-closed CDATA section.'
+		);
+	}
+
+	/**
+	 * Ensures that normative Processing Instruction nodes are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_processing_instruction() {
+		$processor = new WP_HTML_Tag_Processor( '<?wp-bit {"just": "kidding"}?>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_name(),
+			"Should have found comment token but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertSame(
+			WP_HTML_Processor::COMMENT_AS_PI_NODE_LOOKALIKE,
+			$processor->get_comment_type(),
+			'Should have detected a Processing Instruction-like invalid comment.'
+		);
+
+		$this->assertSame(
+			'wp-bit',
+			$processor->get_tag(),
+			"Should have found PI target as tag name but found {$processor->get_tag()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			' {"just": "kidding"}',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that abruptly-closed Processing Instruction nodes are properly parsed as comments.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_abruptly_closed_processing_instruction() {
+		$processor = new WP_HTML_Tag_Processor( '<?version=">=5.3.6"?>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_type(),
+			"Should have found bogus comment but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_name(),
+			"Should have found #comment as name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			'version="',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+
+		$processor->next_token();
+
+		$this->assertSame(
+			'=5.3.6"?>',
+			$processor->get_modifiable_text(),
+			'Should have found remaining syntax from abruptly-closed Processing Instruction.'
+		);
+	}
+
+	/**
+	 * Ensures that common comments are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @dataProvider data_common_comments
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 *
+	 * @param string $html Contains the comment in full.
+	 * @param string $text Contains the appropriate modifiable text.
+	 */
+	public function test_basic_assertion_common_comments( $html, $text ) {
+		$processor = new WP_HTML_Tag_Processor( $html );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_type(),
+			"Should have found comment but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_name(),
+			"Should have found #comment as name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			$text,
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Data provider.
+	 *
+	 * @return array[].
+	 */
+	public function data_common_comments() {
+		return array(
+			'Shortest comment'        => array( '<!-->', '' ),
+			'Short comment'           => array( '<!--->', '' ),
+			'Short comment w/o text'  => array( '<!---->', '' ),
+			'Short comment with text' => array( '<!----->', '-' ),
+			'PI node without target'  => array( '<? missing?>', ' missing?' ),
+			'Invalid PI node'         => array( '<?/missing/>', '/missing/' ),
+			'Invalid ! directive'     => array( '<!something else>', 'something else' ),
+		);
+	}
+
+	/**
+	 * Ensures that normative HTML comments are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_html_comment() {
+		$processor = new WP_HTML_Tag_Processor( '<!-- wp:paragraph -->' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_type(),
+			"Should have found comment but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'#comment',
+			$processor->get_token_name(),
+			"Should have found #comment as name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			' wp:paragraph ',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative DOCTYPE elements are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_doctype() {
+		$processor = new WP_HTML_Tag_Processor( '<!DOCTYPE html>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#doctype',
+			$processor->get_token_type(),
+			"Should have found DOCTYPE but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'html',
+			$processor->get_token_name(),
+			"Should have found 'html' as name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			' html',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative presumptuous tag closers (empty closers) are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_presumptuous_tag() {
+		$processor = new WP_HTML_Tag_Processor( '</>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#presumptuous-tag',
+			$processor->get_token_type(),
+			"Should have found presumptuous tag but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'#presumptuous-tag',
+			$processor->get_token_name(),
+			"Should have found #presumptuous-tag as name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			'',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Ensures that normative funky comments are properly parsed.
+	 *
+	 * @ticket 60170
+	 *
+	 * @since 6.5.0
+	 *
+	 * @covers WP_HTML_Tag_Processor::next_token
+	 */
+	public function test_basic_assertion_funky_comment() {
+		$processor = new WP_HTML_Tag_Processor( '</%url>' );
+		$processor->next_token();
+
+		$this->assertSame(
+			'#funky-comment',
+			$processor->get_token_type(),
+			"Should have found funky comment but found {$processor->get_token_type()} instead."
+		);
+
+		$this->assertSame(
+			'#funky-comment',
+			$processor->get_token_name(),
+			"Should have found #funky-comment as name but found {$processor->get_token_name()} instead."
+		);
+
+		$this->assertNull(
+			$processor->get_tag(),
+			'Should not have been able to query tag name on non-element token.'
+		);
+
+		$this->assertNull(
+			$processor->get_attribute( 'type' ),
+			'Should not have been able to query attributes on non-element token.'
+		);
+
+		$this->assertSame(
+			'%url',
+			$processor->get_modifiable_text(),
+			'Found incorrect modifiable text.'
+		);
+	}
+
+	/**
+	 * Test helper that wraps a string in double quotes.
+	 *
+	 * @param string $s The string to wrap in double-quotes.
+	 * @return string The string wrapped in double-quotes.
+	 */
+	private static function quoted( $s ) {
+		return "\"$s\"";
+	}
+}
--- a/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php
+++ b/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php
@ -557,8 +557,10 @@ class Tests_HtmlApi_WpHtmlTagProcessor extends WP_UnitTestCase {
 		$p = new WP_HTML_Tag_Processor( '<script>abc</script>' );

 		$p->next_tag();
-		$this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </script> tag closer' );
-		$this->assertTrue( $p->is_tag_closer(), 'Indicated a <script> tag opener is a tag closer' );
+		$this->assertFalse(
+			$p->next_tag( array( 'tag_closers' => 'visit' ) ),
+			'Should not have found closing SCRIPT tag when closing an opener.'
+		);

 		$p = new WP_HTML_Tag_Processor( 'abc</script>' );
 		$this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </script> tag closer when there was no tag opener' );
@ -566,8 +568,10 @@ class Tests_HtmlApi_WpHtmlTagProcessor extends WP_UnitTestCase {
 		$p = new WP_HTML_Tag_Processor( '<textarea>abc</textarea>' );

 		$p->next_tag();
-		$this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </textarea> tag closer' );
-		$this->assertTrue( $p->is_tag_closer(), 'Indicated a <textarea> tag opener is a tag closer' );
+		$this->assertFalse(
+			$p->next_tag( array( 'tag_closers' => 'visit' ) ),
+			'Should not have found closing TEXTAREA when closing an opener.'
+		);

 		$p = new WP_HTML_Tag_Processor( 'abc</textarea>' );
 		$this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </textarea> tag closer when there was no tag opener' );
@ -575,8 +579,10 @@ class Tests_HtmlApi_WpHtmlTagProcessor extends WP_UnitTestCase {
 		$p = new WP_HTML_Tag_Processor( '<title>abc</title>' );

 		$p->next_tag();
-		$this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </title> tag closer' );
-		$this->assertTrue( $p->is_tag_closer(), 'Indicated a <title> tag opener is a tag closer' );
+		$this->assertFalse(
+			$p->next_tag( array( 'tag_closers' => 'visit' ) ),
+			'Should not have found closing TITLE when closing an opener.'
+		);

 		$p = new WP_HTML_Tag_Processor( 'abc</title>' );
 		$this->assertTrue( $p->next_tag( array( 'tag_closers' => 'visit' ) ), 'Did not find the </title> tag closer when there was no tag opener' );
@ -2357,6 +2363,7 @@ HTML;
 			'No tags'                => array( 'this is nothing more than a text node' ),
 			'Text with comments'     => array( 'One <!-- sneaky --> comment.' ),
 			'Empty tag closer'       => array( '</>' ),
+			'CDATA as HTML comment'  => array( '<![CDATA[this closes at the first &gt;]>' ),
 			'Processing instruction' => array( '<?xml version="1.0"?>' ),
 			'Combination XML-like'   => array( '<!DOCTYPE xml><?xml version=""?><!-- this is not a real document. --><![CDATA[it only serves as a test]]>' ),
 		);
@ -2410,7 +2417,6 @@ HTML;
 			'Incomplete CDATA'                     => array( '<![CDATA[something inside of here needs to get out' ),
 			'Partial CDATA'                        => array( '<![CDA' ),
 			'Partially closed CDATA]'              => array( '<![CDATA[cannot escape]' ),
-			'Partially closed CDATA]>'             => array( '<![CDATA[cannot escape]>' ),
 			'Unclosed IFRAME'                      => array( '<iframe><div>' ),
 			'Unclosed NOEMBED'                     => array( '<noembed><div>' ),
 			'Unclosed NOFRAMES'                    => array( '<noframes><div>' ),
@ -2507,7 +2513,7 @@ HTML;
 			),
 			'tag inside of CDATA'      => array(
 				'input'    => '<![CDATA[This <is> a <strong id="yes">HTML Tag</strong>]]><span>test</span>',
-				'expected' => '<![CDATA[This <is> a <strong id="yes">HTML Tag</strong>]]><span class="firstTag" foo="bar">test</span>',
+				'expected' => '<![CDATA[This <is> a <strong class="firstTag" foo="bar" id="yes">HTML Tag</strong>]]><span class="secondTag">test</span>',
 			),
 		);
 	}