Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HtmlSanitizer] Introduce HtmlSanitizer component #44681

Open
wants to merge 1 commit into
base: 6.1
Choose a base branch
from

Conversation

@tgalopin
Copy link
Member

@tgalopin tgalopin commented Dec 16, 2021

Q A
Branch? 6.1
Bug fix? no
New feature? yes
Deprecations? no
Tickets Fix #44144
License MIT
Doc PR -

Following discussion on #44144, this PR proposes to add a new HtmlSanitizer component to Symfony.

I refactored the infrastructure of my original html-sanitizer to better follow the W3C Standard Proposal about sanitizers.

The sanitizer can be used as follow:

use Symfony\Component\HtmlSanitizer\HtmlSanitizerConfig;
use Symfony\Component\HtmlSanitizer\HtmlSanitizer;

// Create configuration based on the W3C Sanitizer API standard default
// configuration. This configuration is tolerant and may allow features
// (like CSS injection) you would like to disable.
$config = HtmlSanitizerConfig::createStandardConfig();

// For instance:
$config = HtmlSanitizerConfig::createStandardConfig()
    ->dropElement('style')
    ->dropAttribute('style', '*')
;

// or

// By default, an element not added to the allowed or blocked elements
// will be dropped, including its children
$config = (new HtmlSanitizerConfig())
    // Allow the "div" element and no attribute can be on it
    ->allowElement('div')

    // Allow the "a" element, and the "title" attribute to be on it
    ->allowElement('a', ['title'])

    // Allow the "span" element, and any attribute from the Sanitizer API is allowed
    // (see https://wicg.github.io/sanitizer-api/#default-configuration)
    ->allowElement('span', '*')

    // Block the "section" element: this element will be removed but
    // its children will be retained
    ->blockElement('section')

    // Drop the "div" element: this element will be removed, including its children
    ->dropElement('div')

    // Allow the attribute "title" on the "div" element
    ->allowAttribute('title', ['div'])

    // Allow the attribute "data-custom-attr" on all currently allowed elements
    ->allowAttribute('data-custom-attr', '*')

    // Drop the "data-custom-attr" attribute from the "div" element:
    // this attribute will be removed
    ->dropAttribute('data-custom-attr', ['div'])

    // Drop the "data-custom-attr" attribute from all elements:
    // this attribute will be removed
    ->dropAttribute('data-custom-attr', '*')

    // Transform all HTTP schemes to HTTPS
    ->forceHttpsUrls()

    // Configure which schemes are allowed in links (others will be dropped)
    ->withAllowedLinksSchemes(['https', 'http', 'mailto'])

    // Configure which hosts are allowed in links (by default all are allowed)
    ->withAllowedLinksHosts(['symfony.com', 'example.com'])

    // Allow relative URL in links (by default they are dropped)
    ->withAllowedLinksRelative()

    // Configure which schemes are allowed in img/audio/video/iframe (others will be dropped)
    ->withAllowedMediaSchemes(['https', 'http'])

    // Configure which hosts are allowed in img/audio/video/iframe (by default all are allowed)
    ->withAllowedMediaHosts(['symfony.com', 'example.com'])

    // Allow relative URL in img/audio/video/iframe (by default they are dropped)
    ->withAllowedMediaRelative()
;

$sanitizer = new HtmlSanitizer($config);

// Sanitize a given string, using the configuration provided, and in the
// "body" context (tags only allowed in <head> will be removed)
$sanitizer->sanitize($userInput);

// Sanitize the given string for a usage in a <head> tag
$sanitizer->sanitizeFor('head', $userInput);

// Sanitize the given string for a usage in another tag
$sanitizer->sanitizeFor('title', $userInput); // Will encode as HTML entities
$sanitizer->sanitizeFor('textarea', $userInput); // Will encode as HTML entities
$sanitizer->sanitizeFor('div', $userInput); // Will sanitize as body
$sanitizer->sanitizeFor('section', $userInput); // Will sanitize as body
// ...

The component is working and well tested.

I did some very basic performance testing and it seems to be **~ 2 times faster** than my original library, mostly by relying on much less objects and more on native primitives. My original library was already as fast as the fastest PHP HTML sanitizers, meaning this component may well be the fastest sanitizer in the PHP ecosystem (to better check of course!). That's exciting :) !

I still have on my TODO list:

  • Review and update the annotations in all classes and methods (especially to mark the component as experimental) ;
  • Add the FameworkBundle integration to use it in Symfony apps

Feel free to review the component while I work on finalizing it and the FameworkBundle integration!

@xorik
Copy link

@xorik xorik commented Dec 16, 2021

Hi.

First of all, thank you for you effort, this is really cool idea!

Second: I'm not a Symony team member, so feel free to ignore me.

My concern is that configuring will create a bunch of temporary objects. Maybe it's better to make a separate builder class, something like:

<?php

$config = (new HtmlSanitizerConfigBuilder())
    ->allowElement('div')
    ->allowElement('a', ['title'])
    ->getConfig()
;
$sanitizer = new HtmlSanitizer($config);

And call constructor only in the getConfig method, and not on every step, so it will be less CPU intensive. But maybe it's not required, if this will be somehow optimized in DI to a single constructor call.

Loading

{
$this->config = clone $config;
$this->maxInputLength = $maxInputLength;
$this->parser = $parser ?: new MastermindsParser();
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$this->parser = $parser ?: new MastermindsParser();
$this->parser = $parser ?? new MastermindsParser();

Loading

if (!isset($this->domVisitors[$context])) {
$this->domVisitors[$context] = $this->createDomVisitorForContext($context);
}
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!isset($this->domVisitors[$context])) {
$this->domVisitors[$context] = $this->createDomVisitorForContext($context);
}
$this->domVisitors[$context] ??= $this->createDomVisitorForContext($context);

Loading


try {
$parsed = $this->parser->parse($input);
} catch (\Exception) {
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific about what kind of exception we expect here?

Loading

if (isset($self->blockedElements[$element])) {
unset($self->blockedElements[$element]);
}
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (isset($self->blockedElements[$element])) {
unset($self->blockedElements[$element]);
}
unset($self->blockedElements[$element]);

unset() should be idempotent. No need to safe-guard the call.

Loading

*/
class HtmlSanitizerConfig
{
// Elements that should be removed but their children should be retained
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use PHPDoc syntax to document properties.

Loading

class HtmlSanitizerConfig
{
// Elements that should be removed but their children should be retained
private array $blockedElements = [];
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be more specific on the kind of elements that the array properties in this class may contain?

Loading


public function __construct(ParserInterface $parser, \Throwable $previous = null)
{
parent::__construct('HTML parsing failed using parser '.\get_class($parser), 0, $previous);
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parent::__construct('HTML parsing failed using parser '.\get_class($parser), 0, $previous);
parent::__construct('HTML parsing failed using parser '.\get_debug_type($parser), 0, $previous);

Loading

"league/uri": "^6.5",
"masterminds/html5": "^2.4",
"psr/log": "^1.0|^2.0|^3.0",
"symfony/string": "^5.0|^6.0"
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"symfony/string": "^5.0|^6.0"
"symfony/string": "^5.4|^6.0"

We've bumped all Symfony dependencies to ^5.4|^6.0 on the 6.0 branch. We should do the same for new components.

Loading

private NodeInterface $parentNode;
private string $text;

public function __construct(NodeInterface $parentNode, string $text)
{
$this->parentNode = $parentNode;
$this->text = $text;
}
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private NodeInterface $parentNode;
private string $text;
public function __construct(NodeInterface $parentNode, string $text)
{
$this->parentNode = $parentNode;
$this->text = $text;
}
public function __construct(
private NodeInterface $parentNode,
private string $text
) {
}

Loading

try {
$parsed = parse($url);
} catch (Exception) {
return null;
}

return $parsed;
Copy link
Member

@derrabus derrabus Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try {
$parsed = parse($url);
} catch (Exception) {
return null;
}
return $parsed;
try {
return parse($url);
} catch (Exception) {
return null;
}

Loading

*
* @final
*/
class MastermindsParser implements ParserInterface
Copy link
Member

@GromNaN GromNaN Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this class is not actually final?

Loading


private function visitNode(\DOMNode $domNode, Cursor $cursor)
{
$nodeName = u($domNode->nodeName)->lower()->toString();
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to use symfony/string to lowercase the node name ? Node names are ASCII only AFAIK.

Loading

Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this component is only used once, I would drop this dep 👍🏻

Loading


interface AttributeSanitizerInterface
{
public function getSupportedElements(): array;
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation for those interfaces:

  • method description explaining what the method is about
  • phpdoc describing the type of arrays

Loading


use function League\Uri\build;
use League\Uri\Exception;
use function League\Uri\parse;
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start adding use statements for functions, I suggest sorting the use statements by type

Loading

Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a job for php-ca-fixer

Loading

@@ -0,0 +1,91 @@
Example usage:
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readme should contain the info present in all component readme files.

Loading

// Should the URL in the sanitized document be transformed to HTTPS if they are using HTTP
private bool $forceHttpsUrls = false;

// Sanitizers that should be applied to specific attributes in addition to standard sanitation
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here (sanitation is not a valid word).

Loading

// Should the URL in the sanitized document be transformed to HTTPS if they are using HTTP
private bool $forceHttpsUrls = false;

// Sanitizers that should be applied to specific attributes in addition to standard sanitation
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property description should be in the doc comment of the property rather than in a separate comment

Loading

$this->parser = $parser;
}

public function getParser(): ParserInterface
Copy link
Member

@stof stof Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exposing the parser in the exception ?

Loading

6.1.0
-----

* added the component as experimental
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* added the component as experimental
* Add the component as experimental

Loading

*/
class MastermindsParser implements ParserInterface
{
public function parse(string $html): \DOMNode
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inheritdoc? Or is this not needed?

Loading

use Symfony\Component\HtmlSanitizer\HtmlSanitizerConfig;
use Symfony\Component\HtmlSanitizer\Visitor\AttributeSanitizer\AttributeSanitizerInterface;

class HtmlSanitizerCustomTest extends TestCase
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final please

Loading

use PHPUnit\Framework\TestCase;
use Symfony\Component\HtmlSanitizer\HtmlSanitizerConfig;

class HtmlSanitizerConfigTest extends TestCase
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final please

Loading

use Symfony\Component\HtmlSanitizer\HtmlSanitizer;
use Symfony\Component\HtmlSanitizer\HtmlSanitizerConfig;

class HtmlSanitizerStandardTest extends TestCase
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final please

Loading


private function visitNode(\DOMNode $domNode, Cursor $cursor)
{
$nodeName = u($domNode->nodeName)->lower()->toString();
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this component is only used once, I would drop this dep 👍🏻

Loading

*
* @final
*/
class Cursor
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no real final?

Loading

use function Symfony\Component\String\u;

/**
* The DomVisitor iterate over the parsed DOM tree and visit nodes using the NodeVisitor.
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The DomVisitor iterate over the parsed DOM tree and visit nodes using the NodeVisitor.
* The DomVisitor iterates over the parsed DOM tree and visits nodes using the NodeVisitor.

?

Loading

*/
interface NodeInterface
{
public function addChild(self $node): void;
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the methods

Loading

</php>

<testsuites>
<testsuite name="Symfony String Component Test Suite">
Copy link
Contributor

@OskarStark OskarStark Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong name

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

7 participants