Go to file
Dennis Snell 616e673d3e HTML API: Scan all syntax tokens in a document, read modifiable text.
Since its introduction in WordPress 6.2 the HTML Tag Processor has
provided a way to scan through all of the HTML tags in a document and
then read and modify their attributes. In order to reliably do this, it
also needed to be aware of other kinds of HTML syntax, but it didn't
expose those syntax tokens to consumers of the API.

In this patch the Tag Processor introduces a new scanning method and a
few helper methods to read information about or from each token. Most
significantly, this introduces the ability to read `#text` nodes in the
document.

What's new in the Tag Processor?
================================

 - `next_token()` visits every distinct syntax token in a document.
 - `get_token_type()` indicates what kind of token it is.
 - `get_token_name()` returns something akin to `DOMNode.nodeName`.
 - `get_modifiable_text()` returns the text associated with a token.
 - `get_comment_type()` indicates why a token represents an HTML comment.

Example usage.
==============

{{{
<?php
function strip_all_tags( $html ) {
        $text_content = '';
        $processor    = new WP_HTML_Tag_Processor( $html );

        while ( $processor->next_token() ) {
                if ( '#text' !== $processor->get_token_type() ) {
                        continue;
                }

                $text_content .= $processor->get_modifiable_text();
        }

        return $text_content;
}
}}}

What changes in the Tag Processor?
==================================

Previously, the Tag Processor would scan the opening and closing tag of
every HTML element separately. Now, however, there are special tags
which it only visits once, as if those elements were void tags without
a closer.

These are special tags because their content contains no other HTML or
markup, only non-HTML content.

 - SCRIPT elements contain raw text which is isolated from the rest of
   the HTML document and fed separately into a JavaScript engine. There
   are complicated rules to avoid escaping the script context in the HTML.
   The contents are left verbatim, and character references are not decoded.

 - TEXTARA and TITLE elements contain plain text which is decoded
   before display, e.g. transforming `&amp;` into `&`. Any markup which
   resembles tags is treated as verbatim text and not a tag.

 - IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the
   textarea and title elements, but no character references are decoded.
   For example, `&amp;` inside a STYLE element is passed to the CSS engine
   as the literal string `&amp;` and _not_ as `&`.

Because it's important not treat this inner content separately from the
elements containing it, the Tag Processor combines them when scanning
into a single match and makes their content available as modifiable
text (see below).

This means that the Tag Processor will no longer visit a closing tag for
any of these elements unless that tag is unexpected.

{{{
    <title>There is only a single token in this line</title>
    <title>There are two tokens in this line></title></title>
    </title><title>There are still two tokens in this line></title>
}}}

What are tokens?
================

The term "token" here is a parsing term, which means a primitive unit in
HTML. There are only a few kinds of tokens in HTML:

 - a tag has a name, attributes, and a closing or self-closing flag.
 - a text node, or `#text` node contains plain text which is displayed
   in a browser and which is decoded before display.
 - a DOCTYPE declaration indicates how to parse the document.
 - a comment is hidden from the display on a page but present in the HTML.

There are a few more kinds of tokens that the HTML Tag Processor will
recognize, some of which don't exist as concepts in HTML. These mostly
comprise XML syntax elements that aren't part of HTML (such as CDATA and
processing instructions) and invalid HTML syntax that transforms into
comments.

What is a funky comment?
========================

This patch treats a specific kind of invalid comment in a special way.
A closing tag with an invalid name is considered a "funky comment." In
the browser these become HTML comments just like any other, but their
syntax is convenient for representing a variety of bits of information
in a well-defined way and which cannot be nested or recursive, given
the parsing rules handling this invalid syntax.

 - `</1>`
 - `</%avatar_url>`
 - `</{"wp_bit": {"type": "post-author"}}>`
 - `</[post-author]>`
 - `</__( 'Save Post' );>`

All of these examples become HTML comments in the browser. The content
inside the funky content is easily parsable, whereby the only rule is
that it starts at the `<` and continues until the nearest `>`. There
can be no funky comment inside another, because that would imply having
a `>` inside of one, which would actually terminate the first one.

What is modifiable text?
========================

Modifiable text is similar to the `innerText` property of a DOM node.
It represents the span of text for a given token which may be modified
without changing the structure of the HTML document or the token.

There is currently no mechanism to change the modifiable text, but this
is planned to arrive in a later patch.

Tags
====

Most tags have no modifiable text because they have child nodes where
text nodes are found. Only the special tags mentioned above have
modifiable text.

{{{
    <div class="post">Another day in HTML</div>
    └─ tag ──────────┘└─ text node ─────┘└────┴─ tag
}}}

{{{
    <title>Is <img> &gt; <image>?</title>
    │      └ modifiable text ───┘       │ "Is <img> > <image>?"
    └─ tag ─────────────────────────────┘
}}}

Text nodes
==========

Text nodes are entirely modifiable text.

{{{
    This HTML document has no tags.
    └─ modifiable text ───────────┘
}}}

Comments
========

The modifiable text inside a comment is the portion of the comment that
doesn't form its syntax. This applies for a number of invalid comments.

{{{
    <!-- this is inside a comment -->
    │   └─ modifiable text ──────┘  │
    └─ comment token ───────────────┘
}}}

{{{
    <!-->
    This invalid comment has no modifiable text.
}}}

{{{
    <? this is an invalid comment -->
    │ └─ modifiable text ────────┘  │
    └─ comment token ───────────────┘
}}}

{{{
    <[CDATA[this is an invalid comment]]>
    │       └─ modifiable text ───────┘ │
    └─ comment token ───────────────────┘
}}}

Other token types also have modifiable text. Consult the code or tests
for further information.

Developed in https://github.com/WordPress/wordpress-develop/pull/5683
Discussed in https://core.trac.wordpress.org/ticket/60170

Follows [57575]

Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam
Fixes #60170



git-svn-id: https://develop.svn.wordpress.org/trunk@57348 602fd350-edb4-49c9-b593-d223f7449a82
2024-01-24 23:35:46 +00:00
.cache Build/Test Tools: Cache the results of PHP_CodeSniffer across workflow runs. 2021-11-16 14:17:26 +00:00
.devcontainer Build/Test Tools: Change the version of Node.js in the Codespaces container. 2023-08-24 14:07:53 +00:00
.github Build/Test Tools: Increase the max old space size in Node. 2024-01-08 18:36:56 +00:00
src HTML API: Scan all syntax tokens in a document, read modifiable text. 2024-01-24 23:35:46 +00:00
tests HTML API: Scan all syntax tokens in a document, read modifiable text. 2024-01-24 23:35:46 +00:00
tools Update editor related npm packages 2023-09-26 14:20:18 +00:00
.editorconfig General: Instruct file editors not to trim trailing whitespace in markdown files. 2018-03-20 22:14:53 +00:00
.env Build/Test: Revert unintentional .env change in [56449]. 2023-08-24 17:33:04 +00:00
.eslintignore Build/Test Tools: Enable JSDocs to be linted with ESLint. 2020-07-27 23:33:51 +00:00
.eslintrc-jsdoc.js Build/Test Tools: Enable JSDocs to be linted with ESLint. 2020-07-27 23:33:51 +00:00
.git-blame-ignore-revs Build/Test Tools: First pass at a .git-blame-ignore-revs file. 2022-03-29 03:20:29 +00:00
.gitignore Build/Test Tools: Migrate Puppeteer tests to Playwright. 2023-10-13 08:11:41 +00:00
.jshintrc Emoji: Optimize emoji loader with sessionStorage, willReadFrequently, and OffscreenCanvas. 2023-06-27 17:22:59 +00:00
.npmrc Build/Test Tools: Bump the required versions of Node.js and npm. 2023-08-09 18:52:42 +00:00
.nvmrc Build/Test Tools: Raise minimum required version of Node.js/npm. 2023-12-20 18:44:57 +00:00
.prettierrc.js Build Tools: Configure prettier properly. 2024-01-23 07:59:22 +00:00
.version-support-mysql.json Build/Test Tools: Add missing MySQL versions for WordPress 5.0. 2023-12-22 01:36:54 +00:00
.version-support-php.json Build/Test Tools: Add missing PHP versions for 6.5 and 6.4. 2023-12-22 01:12:29 +00:00
composer.json Coding Standards: Upgrade WPCS to version 3.0.1. 2023-10-08 13:07:46 +00:00
CONTRIBUTING.md Build/Test Tools: Add a missing word to the CONTRIBUTING.md file. 2021-02-23 19:59:20 +00:00
docker-compose.yml Build/Test Tools: Ensure database containers are prepared for commands. 2023-08-24 21:02:47 +00:00
Gruntfile.js Build/Test Tools: Expand "imagemin" Grunt task to cover default themes. 2024-01-22 11:41:33 +00:00
jsdoc.conf.json Build Tools: Fix JSDoc configuration include paths. 2018-06-14 12:45:06 +00:00
package-lock.json Build Tools: Configure prettier properly. 2024-01-23 07:59:22 +00:00
package.json Build Tools: Configure prettier properly. 2024-01-23 07:59:22 +00:00
phpcompat.xml.dist Coding Standards: Remove unnecessary directives in the PHPCompatibility ruleset. 2023-09-23 11:05:18 +00:00
phpcs.xml.dist HTML API: Add support for list elements. 2024-01-10 14:03:57 +00:00
phpunit.xml.dist Build/Test Tools: Remove random_compat from PHPCS and PHPUnit configuration files. 2023-09-24 07:43:50 +00:00
README.md Build/Test Tools: Raise minimum required version of Node.js/npm. 2023-12-20 18:44:57 +00:00
SECURITY.md Trunk is now 6.5 alpha. 2023-10-17 18:39:45 +00:00
webpack.config.js Build: Enable React Fast Refresh for block development 2022-04-11 16:08:12 +00:00
wp-cli.yml Build/Tests: Default to running unit tests from src. 2019-01-09 10:09:02 +00:00
wp-config-sample.php Text Changes: Update mentions of “web site” to “website” for consistency. 2023-11-22 17:42:11 +00:00
wp-tests-config-sample.php Docs: Use generic references to "Database" in wp-config-sample.php. 2021-12-14 08:42:16 +00:00

WordPress

Welcome to the WordPress development repository! Please check out the contributor handbook for information about how to open bug reports, contribute patches, test changes, write documentation, or get involved in any way you can.

Getting Started

Using GitHub Codespaces

To get started, create a codespace for this repository by clicking this 👇

Open in GitHub Codespaces

A codespace will open in a web-based version of Visual Studio Code. The dev container is fully configured with softwares needed for this project.

Note: Dev containers is an open spec which is supported by GitHub Codespaces and other tools.

In some browsers the keyboard shortcut for opening the command palette (Ctrl/Command + Shift + P) may collide with a browser shortcut. The command palette can be opened via the F1 key or via the cog icon in the bottom left of the editor.

When opening your codespace, be sure to wait for the postCreateCommand to finish running to ensure your WordPress install is successfully set up. This can take a few minutes.

Local development

WordPress is a PHP, MySQL, and JavaScript based project, and uses Node for its JavaScript dependencies. A local development environment is available to quickly get up and running.

You will need a basic understanding of how to use the command line on your computer. This will allow you to set up the local development environment, to start it and stop it when necessary, and to run the tests.

You will need Node and npm installed on your computer. Node is a JavaScript runtime used for developer tooling, and npm is the package manager included with Node. If you have a package manager installed for your operating system, setup can be as straightforward as:

  • macOS: brew install node
  • Windows: choco install nodejs
  • Ubuntu: apt install nodejs npm

If you are not using a package manager, see the Node.js download page for installers and binaries.

Note: WordPress currently only officially supports Node.js 20.x and npm 10.x.

You will also need Docker installed and running on your computer. Docker is the virtualization software that powers the local development environment. Docker can be installed just like any other regular application.

Development Environment Commands

Ensure Docker is running before using these commands.

To start the development environment for the first time

Clone the current repository using git clone https://github.com/WordPress/wordpress-develop.git. Then in your terminal move to the repository folder cd wordpress-develop and run the following commands:

npm install
npm run build:dev
npm run env:start
npm run env:install

Your WordPress site will be accessible at http://localhost:8889. You can see or change configurations in the .env file located at the root of the project directory.

To watch for changes

If you're making changes to WordPress core files, you should start the file watcher in order to build or copy the files as necessary:

npm run dev

To stop the watcher, press ctrl+c.

To run a WP-CLI command

npm run env:cli -- <command>

WP-CLI has many useful commands you can use to work on your WordPress site. Where the documentation mentions running wp, run npm run env:cli -- instead. For example:

npm run env:cli -- help

To run the tests

These commands run the PHP and end-to-end test suites, respectively:

npm run test:php
npm run test:e2e

You can pass extra parameters into the PHP tests by adding -- and then the command-line options:

npm run test:php -- --filter <test name>
npm run test:php -- --group <group name or ticket number>

To restart the development environment

You may want to restart the environment if you've made changes to the configuration in the docker-compose.yml or .env files. Restart the environment with:

npm run env:restart

To stop the development environment

You can stop the environment when you're not using it to preserve your computer's power and resources:

npm run env:stop

To start the development environment again

Starting the environment again is a single command:

npm run env:start

Credentials

These are the default environment credentials:

  • Database Name: wordpress_develop
  • Username: root
  • Password: password

To login to the site, navigate to http://localhost:8889/wp-admin.

  • Username: admin
  • Password: password

Note: With Codespaces, open the portforwarded URL from the ports tab in the terminal, and append /wp-admin to login to the site.

To generate a new password (recommended):

  1. Go to the Dashboard
  2. Click the Users menu on the left
  3. Click the Edit link below the admin user
  4. Scroll down and click 'Generate password'. Either use this password (recommended) or change it, then click 'Update User'. If you use the generated password be sure to save it somewhere (password manager, etc).