Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting markdown links #514

Open
jvanasco opened this issue Oct 14, 2020 · 0 comments
Open

supporting markdown links #514

jvanasco opened this issue Oct 14, 2020 · 0 comments

Comments

@jvanasco
Copy link
Contributor

@jvanasco jvanasco commented Oct 14, 2020

I maintain a package that uses htmllib5 to translate Markdown into HTML (https://github.com/jvanasco/html5lib_to_markdown) alongside our Bleach usage for dealing with user-submitted text.

I thought I had a workaround for some odd behavior between Python2 and Python3, but after encountering some issues migrating the CI tests to tox, I dug into my library and this library... and I realized there was a bigger problem.

The problem is that while almost all of Markdown is valid HTML, it also support a quick "link" format which exists as a url in an unnamed tag:

<https://example.com/path/to>

While my first reaction was to handle this in a pre-processor, I remembered that context matters and I need to know if I encounter this in a code-formatting block or not -- so I need to integrate this with a tokenizer.

When these links are handled by this library's tokenizer's emitCurrentToken, the current logic creates a token name of "http:", "https:", or "mailto:". This is great.

However, the token's raw data, however, is cast into an ordered dict - which blows away any duplicate values and a chance to recreate the tag -- and some other characters trip up the delimiting. For example:

<https://example.com/a/aa/b/bb/c/d/e/f/g?foo=bar&bar=foo;#biz>

Is there any chance of html5lib supporting a use case of keeping the full data of these unnamed urls tags somehow? I don't expect them to be serialized by this library, as this is a weird HTMLish format that is not real HTML - but Markdown is a popular and widespread format that is mostly valid HTML, except for this one _____ tag.

There are a few ideas I had that are 70% towards a PR for this - but if this use-case is too outside the scope of this library, I need to spend my time looking for alternatives.

Thanks, J

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.