split token value and token kind #10399

bvanjoi · 2025-04-21T12:42:33Z

This constitutes a substantial code revision. We should first evaluate whether it yields measurable performance enhancements.

changeset-bot · 2025-04-21T12:42:38Z

⚠️ No Changeset found

Latest commit: f46cf93

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

codspeed-hq · 2025-04-21T12:46:53Z

CodSpeed Performance Report

Merging #10399 will degrade performances by 3.23%

_{Comparing bvanjoi:pref (f46cf93) with main (a29eb29)}

Summary

⚡ 12 improvements
❌ 2 regressions
✅ 138 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`serialization of serde`	2.7 µs	2.7 µs	-2.14%
⚡	`es/codegen/with-parser/large`	1.4 ms	1.4 ms	+2.5%
⚡	`es/parser/angular`	29.6 ms	28.3 ms	+4.44%
⚡	`es/parser/backbone`	4.6 ms	4.4 ms	+4.27%
⚡	`es/parser/cal-com`	90.4 ms	87.9 ms	+2.9%
⚡	`es/parser/colors`	122.1 µs	119.2 µs	+2.49%
⚡	`es/parser/jquery`	23.8 ms	22.7 ms	+4.57%
⚡	`es/parser/jquery mobile`	36.5 ms	35.1 ms	+4.01%
⚡	`es/parser/mootools`	18.9 ms	17.9 ms	+5.18%
⚡	`es/parser/three`	112.4 ms	107.3 ms	+4.71%
⚡	`es/parser/typescript`	635.5 ms	610.1 ms	+4.17%
⚡	`es/parser/underscore`	4 ms	3.8 ms	+4.7%
⚡	`es/parser/yui`	18.1 ms	17.3 ms	+4.32%
❌	`typescript/fast-strip`	660.9 µs	683 µs	-3.23%

bvanjoi · 2025-04-22T07:18:02Z

The performance gains demonstrate a compelling case for implementation.

Let me explain my idea:

First, split Token into TokenKind (the type of token, which is still referred to as Token in this PR) and TokenValue. This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing).
And then, these changes mean there are now two different types of Token between swc_ecma_lexer and swc_ecma_parser: one contains the TokenValue, while the other does not.
The points above mean that using a single Parser between swc_ecma_lexer and swc_ecma_parser is more complicated. We would need to create a trait to handle different types of Token and manage their behaviors, and the ROI for this seems low.
So, I made the following changes: I copied the swc_ecma_parser/parser into swc_ecma_lexer, meaning that swc_ecma_lexer now includes the Parser itself. I also created a swc_ecma_parser/lexer folder, which contains a lexer that serves only the swc_ecma_parser/Parser and is not exposed externally. With these changes, if you want to generate the correct tokens, you should use swc_ecma_lexer/{Lexer, Parser}, while the swc_ecma_parser/Parser is now only for quickly generating the AST.
Based on these changes, I’ve removed some tests that ensured the lexer worked correctly in swc_ecma_parser and switched to using swc_ecma_lexer in swc_fast_ts_strip.
Another change is simply replacing Token with TokenKind and TokenValue.

@kdy1 Do you agree with these changes? If so, I will fix CI and make it ready for review.

CPunisher · 2025-04-22T08:22:00Z

I have seen the two lexers and two parsers in swc_ecma_lexer and swc_ecma_parser. Do we need maintain all of them or one pair is WIP while another is released?

bvanjoi · 2025-04-22T10:07:23Z

Do we need maintain all of them

Yes, I think so. Perhaps we can find a better way to integrate it.

kdy1 · 2025-04-22T10:10:12Z

Why do we need two separate parser/lexer/token types?
What's the problem you see in the token definition in #10397 ?

bvanjoi · 2025-04-22T10:25:40Z

Why do we need two separate token types?

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

Why do we need two separate lexer ?

Since the lexical analysis implementations differ, swc_ecma_lexer/lexer guarantees token correctness while swc_ecma_parser/lexer does not maintain the same level of validation.

Why do we need two separate parser?

Using a single parser to handle different lexical logic introduces unnecessary development complexity. While I've separated them for now, I believe there should be a better way to consolidate this logic.

CPunisher · 2025-04-22T11:52:51Z

The value field defined in Token(source) is not required for parsing.

I'm not sure why TokenValue is not required for parsing. For example, doesn't an identifier token value required for constructing identifier ast node?

CPunisher · 2025-04-22T11:55:36Z

This means we can't retrieve the correct combination of TokenKind and TokenValue after parsing, because the TokenValue has already been consumed during parsing(edit: Unless the TokenValue is stored elsewhere, but this is unnecessary for parsing)

I guess this is the root cause of the splitting. But I'm not too clear about it. Could you please provide more detailed examples?

bvanjoi · 2025-04-22T12:05:54Z

I'm not sure why TokenValue is not required for parsing.

TokenValue is not required in the Token. For example:

a + b
^        --> `token: ident`, `lexer.state.token_value: a`
    ^    --> `token: ident`, `lexer.state.token_value: b`
    
// rather than
a + b
^        --> `token: token_kind: ident, token_value: a`
    ^    --> `token: token_kind: ident, token_value: b` // actually, we dont need to store `token_value` in each `token`

CPunisher · 2025-04-22T13:27:53Z

Since both of swc_ecma_parser and swc_ecma_lexer contain Lexer and Parser, under what circumstances should we use swc_ecma_parser? In other words, what's the difference between them now?

kdy1 · 2025-04-22T13:46:13Z

The value field defined in Token(source) is not required for parsing. Moving it to the outer struct would reduce memory usage.

Additionally, the had_line_break field in Token during parsing also unnecessary and can be removed in the future.

I understood, but I think the gain is too small if we have to maintain two tokens/lexers/parsers. My guess is that #10397 is the sweet point, but not sure.

bvanjoi · 2025-04-22T14:31:11Z

In other words, what's the difference between them now

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I think the gain is too small if we have to maintain two tokens/lexers/parsers

I suggest considering landing this change for three reasons:

Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type). I think with this change we can eventually achieve the same performance as oxc/parser.
Strict token validation is not commonly needed, and sacrificing performance improvements for it may not be a good trade-off.
We can reuse code by creating a shared crate (e.g., swc_ecma_lexer_common) to store common functions and traits.

CPunisher · 2025-04-22T14:46:43Z

The swc_ecma_parser/Parser implementation prioritizes AST generation performance at the cost of potential token validation gaps, whereas swc_ecma_lexer/Parser enforces strict token-level correctness guarantees during AST construction.

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser. If Strict token validation is not neccessary for generating correct AST, why we still keep swc_ecma_lexer/Parser in the project?

Using a separate lexer and parser between swc_ecma_parser and swc_ecma_lexer is the foundation for future optimizations (such as removing TokenContexts, which requires using a custom lexer; and remove the had_line_break field in Token, which need different Token type).

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed? If not, why can't do so?

bvanjoi · 2025-04-22T14:59:29Z

I'm not too clear about Strict token validation. Generating correct AST is always the foundation for parser.

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

why we still keep swc_ecma_lexer/Parser in the project

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

Do you mean copy the new Lexer and Parser, optimize on top of the new ones, and replace the original ones once they are completed

I'm having difficulty understanding the intended meaning here. Could you please clarify?

CPunisher · 2025-04-22T15:17:51Z

Consider the tokenization of < in <div/> when parsing a JSX file. Under strict parsing rules, this would be classified as a Token::JsxTagStart. However, in practice, treating it as a simple Token::LessThan is sufficient for generating a correct AST. The stricter approach requiring Token::JsxTagStart forces us to maintain unnecessary contextual information purely for AST construction purposes.

Learnt, and I agree this is a very good idea.

The primary purpose of swc_ecma_lexer/Parser is to ensure the lexer can function reliably, particularly for cases involving try_parse methods that may require multiple lexing attempts. Unlike a single-pass lexer, this implementation needs to handle scenarios where tokens cannot be fully determined in one pass.

(Alternatively, we might consider eventually removing swc_ecma_lexer/Parser entirely and simply exposing a lexical_analysis function that handles tokenization independently.)

I see. Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

I'm having difficulty understanding the intended meaning here. Could you please clarify?

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser (but maybe not yet finished)?

bvanjoi · 2025-04-22T15:37:54Z

Does this mean swc_ecma_parser/Parser MAY produce wrong results(without some preconditions satisfied)?

Certainly not - the current token and parser state combination is sufficient for producing a correct AST. This approach aligns with many established parser implementations, including TypeScript's.

Like swc_ecma_fast_parser, early it's a trial to rewrite the parser and then replace the swc_ecma_parser, but it seems to be aborted. My question is whether you are also writing a completely new parser

I agree with this perspective in principle, but I don't believe creating a separate crate would actually reduce the workload. My primary goal is to optimize swc_ecma_parser for faster AST generation while maintaining correctness.

To summarize these changes:

Main branch: swc_ecma_parser generates both AST and tokens.
After this PR:
- swc_ecma_lexer takes over token generation.
- swc_ecma_parser focuses solely on AST production, and it no longer exposes internal tokens.

kdy1 · 2025-04-22T15:40:41Z

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

CPunisher · 2025-04-22T15:46:12Z

Ok, I got the idea and I like it. Would it be possible to use a single struct with const generic or generic to make maintenance easy? I'm worried about the maintenance burden.

Agree. I expect finally we should eliminate all duplication and get a clear lexer and a clear parser (just like now, and faster), but we should still keep the code in the process as maintainable as possible, which also bring less burden for bug fixing.

kdy1 · 2025-04-22T15:52:13Z

If it's inevitable at the moment, I think we can merge this PR as is and refactor these with other PRs, but I think we should at least have concrete plan for future deduplication.

bvanjoi · 2025-04-22T15:56:53Z

I'm worried about the maintenance burden.

I am also worried about the maintenance cost: At present, it seems troublesome to abstract swc_ecma_lexer/Parser and swc_ecma_parser/Parser into an independent trait. My suggestion is:

Create a library swc_ecma_lexer_common, and then put the common parts of swc_ecma_lexer and swc_ecma_parser into it, and swc_ecma_parser will no longer depend on swc_ecma_lexer.
Gradually sort out the reusable logic of the two and implement it in swc_ecma_lexer_common.

What do you think?

kdy1 · 2025-04-22T16:02:00Z

It sounds acceptable for now, but I want to hear opinion from @CPunisher, too.

bvanjoi requested review from a team as code owners April 21, 2025 12:42

bvanjoi force-pushed the pref branch 5 times, most recently from 2c80e1a to c2b1b12 Compare April 22, 2025 03:39

bvanjoi force-pushed the pref branch from 8867c2d to f4bb503 Compare April 22, 2025 12:57

bvanjoi added 26 commits April 30, 2025 12:53

chore(es/lexer): common emit_error of lexer

f3e27dc

chore(es/lexer): common skip_line_comment

c1aaeea

chore(es/lexer): common skip_space

32355d8

chore(es/lexer): rm store_comment

673ee7f

chore(es/lexer): common number, part1

dfe9356

chore(es/lexer): common number, part2

f12f62b

chore(es/lexer): common consume_pending_comments

5feadcc

chore(es/lexer): common read_jsx_word

0e8619c

chore(es/lexer): common read_jsx_str

bf38bbc

chore(es/lexer): common read_token, part1

ef31789

chore(es/lexer): common read_token, part2

273ba98

chore(es/lexer): common read_token, part3

54e7bb4

chore(es/parser): common ExprExt

13afeb8

chore(es/parser): delete test in lexer/parser

41da3f0

chore(es/parser): common with_state

d484bf7

chore(es/parser): common buffer

a8abef2

chore(es/parser): common WithCtx

2a117df

chore(es/lexer): some traits

fdc497c

chore(es/parser): common emit_error

c88d167

chore(es/parser): common helpers

24445d4

chore(es/parser): common Verifier

a46dc6f

chore(es/parser): common verify_expr

b68bdf0

chore(es/parser): common parse_lit

ae5d009

chore(es/parser): common parse_ident_name

f8d098b

chore(es/parser): common make_decl_declare

ec9ce2a

chore(es/parser): common parse private name

1731817

bvanjoi force-pushed the pref branch 2 times, most recently from 767aa63 to be3e65f Compare April 30, 2025 09:10

chore(es/parser): common parse_ident

f46cf93

bvanjoi force-pushed the pref branch from be3e65f to f46cf93 Compare April 30, 2025 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split token value and token kind #10399

split token value and token kind #10399

bvanjoi commented Apr 21, 2025

changeset-bot bot commented Apr 21, 2025 •

edited

Loading

codspeed-hq bot commented Apr 21, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025

kdy1 commented Apr 22, 2025

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025

kdy1 commented Apr 22, 2025

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025

bvanjoi commented Apr 22, 2025

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025

kdy1 commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

kdy1 commented Apr 22, 2025

bvanjoi commented Apr 22, 2025

kdy1 commented Apr 22, 2025

split token value and token kind #10399

Are you sure you want to change the base?

split token value and token kind #10399

Conversation

bvanjoi commented Apr 21, 2025

changeset-bot bot commented Apr 21, 2025 • edited Loading

⚠️ No Changeset found

codspeed-hq bot commented Apr 21, 2025 • edited Loading

CodSpeed Performance Report

Merging #10399 will degrade performances by 3.23%

Summary

Benchmarks breakdown

bvanjoi commented Apr 22, 2025 • edited Loading

CPunisher commented Apr 22, 2025 • edited Loading

bvanjoi commented Apr 22, 2025

kdy1 commented Apr 22, 2025

bvanjoi commented Apr 22, 2025 • edited Loading

CPunisher commented Apr 22, 2025 • edited Loading

CPunisher commented Apr 22, 2025 • edited Loading

bvanjoi commented Apr 22, 2025 • edited Loading

CPunisher commented Apr 22, 2025

kdy1 commented Apr 22, 2025

bvanjoi commented Apr 22, 2025 • edited Loading

CPunisher commented Apr 22, 2025

bvanjoi commented Apr 22, 2025

CPunisher commented Apr 22, 2025 • edited Loading

bvanjoi commented Apr 22, 2025

kdy1 commented Apr 22, 2025 • edited Loading

CPunisher commented Apr 22, 2025 • edited Loading

kdy1 commented Apr 22, 2025

bvanjoi commented Apr 22, 2025

kdy1 commented Apr 22, 2025

changeset-bot bot commented Apr 21, 2025 •

edited

Loading

codspeed-hq bot commented Apr 21, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

bvanjoi commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading

kdy1 commented Apr 22, 2025 •

edited

Loading

CPunisher commented Apr 22, 2025 •

edited

Loading