Skip to content

split token value and token kind #10399

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 48 commits into
base: main
Choose a base branch
from
Draft

split token value and token kind #10399

wants to merge 48 commits into from

Conversation

bvanjoi
Copy link
Contributor

@bvanjoi bvanjoi commented Apr 21, 2025

This constitutes a substantial code revision. We should first evaluate whether it yields measurable performance enhancements.

@bvanjoi bvanjoi requested review from a team as code owners April 21, 2025 12:42
Copy link

changeset-bot bot commented Apr 21, 2025

⚠️ No Changeset found

Latest commit: f46cf93

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

codspeed-hq bot commented Apr 21, 2025

CodSpeed Performance Report

Merging #10399 will degrade performances by 3.23%

Comparing bvanjoi:pref (f46cf93) with main (a29eb29)

Summary

⚡ 12 improvements
❌ 2 regressions
✅ 138 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
serialization of serde 2.7 µs 2.7 µs -2.14%
es/codegen/with-parser/large 1.4 ms 1.4 ms +2.5%
es/parser/angular 29.6 ms 28.3 ms +4.44%
es/parser/backbone 4.6 ms 4.4 ms +4.27%
es/parser/cal-com 90.4 ms 87.9 ms +2.9%
es/parser/colors 122.1 µs 119.2 µs +2.49%
es/parser/jquery 23.8 ms 22.7 ms +4.57%
es/parser/jquery mobile 36.5 ms 35.1 ms +4.01%
es/parser/mootools 18.9 ms 17.9 ms +5.18%
es/parser/three 112.4 ms 107.3 ms +4.71%
es/parser/typescript 635.5 ms 610.1 ms +4.17%
es/parser/underscore 4 ms 3.8 ms +4.7%
es/parser/yui 18.1 ms 17.3 ms +4.32%
typescript/fast-strip 660.9 µs 683 µs -3.23%

@bvanjoi bvanjoi force-pushed the pref branch 5 times, most recently from 2c80e1a to c2b1b12 Compare April 22, 2025 03:39
@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

The performance gains demonstrate a compelling case for implementation.

Let me explain my idea:

  • First, split Token into TokenKind (the type of token, which is still referred to as Token in this PR) and TokenValue. This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing).
  • And then, these changes mean there are now two different types of Token between swc_ecma_lexer and swc_ecma_parser: one contains the TokenValue, while the other does not.
  • The points above mean that using a single Parser between swc_ecma_lexer and swc_ecma_parser is more complicated. We would need to create a trait to handle different types of Token and manage their behaviors, and the ROI for this seems low.
  • So, I made the following changes: I copied the swc_ecma_parser/parser into swc_ecma_lexer, meaning that swc_ecma_lexer now includes the Parser itself. I also created a swc_ecma_parser/lexer folder, which contains a lexer that serves only the swc_ecma_parser/Parser and is not exposed externally. With these changes, if you want to generate the correct tokens, you should use swc_ecma_lexer/{Lexer, Parser}, while the swc_ecma_parser/Parser is now only for quickly generating the AST.
  • Based on these changes, I’ve removed some tests that ensured the lexer worked correctly in swc_ecma_parser and switched to using swc_ecma_lexer in swc_fast_ts_strip.
  • Another change is simply replacing Token with TokenKind and TokenValue.

@kdy1 Do you agree with these changes? If so, I will fix CI and make it ready for review.

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

I have seen the two lexers and two parsers in swc_ecma_lexer and swc_ecma_parser. Do we need maintain all of them or one pair is WIP while another is released?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

Do we need maintain all of them

Yes, I think so. Perhaps we can find a better way to integrate it.

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

Why do we need two separate parser/lexer/token types?
What's the problem you see in the token definition in #10397 ?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

Why do we need two separate token types?

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

Why do we need two separate lexer ?

Since the lexical analysis implementations differ, swc_ecma_lexer/lexer guarantees token correctness while swc_ecma_parser/lexer does not maintain the same level of validation.

Why do we need two separate parser?

Using a single parser to handle different lexical logic introduces unnecessary development complexity. While I've separated them for now, I believe there should be a better way to consolidate this logic.

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

The value field defined in Token(source) is not required for parsing.

I'm not sure why TokenValue is not required for parsing. For example, doesn't an identifier token value required for constructing identifier ast node?

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

  • This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing)

I guess this is the root cause of the splitting. But I'm not too clear about it. Could you please provide more detailed examples?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

I'm not sure why TokenValue is not required for parsing.

TokenValue is not required in the Token. For example:

a + b
^        --> `token: ident`, `lexer.state.token_value: a`
    ^    --> `token: ident`, `lexer.state.token_value: b`
    
// rather than
a + b
^        --> `token: token_kind: ident, token_value: a`
    ^    --> `token: token_kind: ident, token_value: b` // actually, we dont need to store `token_value` in each `token`

@CPunisher
Copy link
Member

Since both of swc_ecma_parser and swc_ecma_lexer contain Lexer and Parser, under what circumstances should we use swc_ecma_parser? In other words, what's the difference between them now?

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

I understood, but I think the gain is too small if we have to maintain two tokens/lexers/parsers. My guess is that #10397 is the sweet point, but not sure.

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

In other words, what's the difference between them now

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I think the gain is too small if we have to maintain two tokens/lexers/parsers

I suggest considering landing this change for three reasons:

  • ​Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type). I think with this change we can eventually achieve the same performance as oxc/parser.
  • Strict token validation is not commonly needed, and sacrificing performance improvements for it may not be a good trade-off.
  • ​We can reuse code by creating a shared crate (e.g., swc_ecma_lexer_common) to store common functions and traits.

@CPunisher
Copy link
Member

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser. If Strict token validation is not neccessary for generating correct AST, why we still keep swc_ecma_lexer/Parser in the project?

  • ​Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type).

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed? If not, why can't do so?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser.

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

why we still keep swc_ecma_lexer/Parser in the project

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed

I'm having difficulty understanding the intended meaning here. Could you please clarify?

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

Learnt, and I agree this is a very good idea.

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

I see. Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

I'm having difficulty understanding the intended meaning here. Could you please clarify?

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser (but maybe not yet finished)?

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

Certainly not - the current token and parser state combination is sufficient for producing a correct AST. This approach aligns with many established parser implementations, including TypeScript's.

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser

I agree with this perspective in principle, but I don't believe creating a separate crate would actually reduce the workload. My primary goal is to optimize swc_ecma_parser for faster AST generation while maintaining correctness.

To summarize these changes:

  • Main branch: swc_ecma_parser generates both AST and tokens.
  • After this PR:
    • swc_ecma_lexer takes over token generation.
    • swc_ecma_parser focuses solely on AST production, and it no longer exposes internal tokens.

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

@CPunisher
Copy link
Member

CPunisher commented Apr 22, 2025

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

Agree. I expect finally we should eliminate all duplication and get a clear lexer and a clear parser (just like now, and faster), but we should still keep the code in the process as maintainable as possible, which also bring less burden for bug fixing.

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

If it's inevitable at the moment, I think we can merge this PR as is and refactor these with other PRs, but I think we should at least have concrete plan for future deduplication.

@bvanjoi
Copy link
Contributor Author

bvanjoi commented Apr 22, 2025

I'm worried about the maintenance burden.

I am also worried about the maintenance cost: At present, it seems troublesome to abstract swc_ecma_lexer/Parser and swc_ecma_parser/Parser into an independent trait. My suggestion is:

  • Create a library swc_ecma_lexer_common, and then put the common parts of swc_ecma_lexer and swc_ecma_parser into it, and swc_ecma_parser will no longer depend on swc_ecma_lexer.
  • Gradually sort out the reusable logic of the two and implement it in swc_ecma_lexer_common.

What do you think?

@kdy1
Copy link
Member

kdy1 commented Apr 22, 2025

It sounds acceptable for now, but I want to hear opinion from @CPunisher, too.

@bvanjoi bvanjoi force-pushed the pref branch 2 times, most recently from 767aa63 to be3e65f Compare April 30, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants