Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(markdown): improve markdown parser and diagnostics #5292

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

afonsojramos
Copy link
Contributor

Summary

I've made some needed improvements to the Markdown parser's structure and error reporting. The main issues I tackled were:

  1. Reorganized the parser code with separate modules for each block type
  2. Fixed diagnostic messages for headers without spaces after hash symbols
  3. Made error handling more consistent
  4. Added a README with usage examples

Before these changes, the parser couldn't properly report when someone wrote "#Header" instead of "# Header". Now it catches these errors with the right position info and a clear message. The new structure should also make it easier for others to work on the parser going forward.

Test Plan

The implementation correctness is demonstrated by:

✅ Tests passing
✅ Invalid header reporting diagnostics
✅ Some manual testing

To verify:

  • Run cargo test -p biome_markdown_parser to confirm all tests pass
  • Check that the invalid_header.md test properly emits the diagnostic "Invalid header format: missing space after '#'"
  • Try parsing markdown with invalid headers and verify the error message and location are correct

…ens and kinds

- Rename `DemoSyntaxTreeBuilder` to `MarkdownSyntaxTreeBuilder` in markdown factory
- Extend `MarkdownSyntaxKind` with new tokens for tables, lists, blockquotes, and delimiters
- Add utility functions in `make.rs` to create various Markdown syntax tokens
- Update macro rules to include new token types
- Introduce `markdown` module in configuration crate
- Add `MarkdownConfiguration` struct with formatter, linter, and parser options
- Extend `Configuration` struct to include optional Markdown configuration
- Implement default configuration and configuration builder for Markdown
- Update `LintCommandPayload` to include Markdown linter configuration
- Modify CLI session to handle Markdown linter in command processing
- Add Markdown linter configuration import and merge logic in command runner
@github-actions github-actions bot added A-CLI Area: CLI A-Project Area: project A-Parser Area: parser labels Mar 7, 2025
…kdown block types

- Enhance parsing logic for blockquotes, headers, list items, and thematic breaks
- Add more robust whitespace and marker handling
- Improve continuation line parsing for list items
- Implement more precise checkpoint and rewind mechanisms
- Add better error recovery and validation for different block types
Copy link

codspeed-hq bot commented Mar 7, 2025

CodSpeed Performance Report

Merging #5292 will not alter performance

Comparing afonsojramos:feat/markdown-parser (0222689) with main (08de81d)

Summary

✅ 95 untouched benchmarks

Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contribution! I haven't reviewed the logic of the parser because there are other important things to change. Briefly:

  • revert snapshots that caused regressions
  • remove files that cause exptions
  • do not expose any configuration to our users yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit conflicted about accepting the parsing of headers because there was another contributor who was working on it before you: #5208

I wish you could have engaged with us before jumping straight to developments. Let's see how it goes, since this PR needs some reviews.

Comment on lines 144 to 149
// For now, update snapshots automatically
insta::with_settings!({
prepend_module_to_snapshot => false,
snapshot_path => &test_directory,
// Auto-accept new snapshots during development
input_file => test_case_path.to_string_lossy().to_string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, let's avoid this. It's easier to accept a snapshot with another CLI command than figuring out why a snapshot was updated, because we might forget to update this code.

@dyc3 dyc3 self-requested a review March 7, 2025 19:03
…figuration imports

- Delete markdown parser test files including `parser_test.rs` and `spec_test.rs`
- Remove unused Markdown configuration import from CLI commands module
- Update spec test to use more robust debugging and tree structure printing
- Remove test files for invalid markdown headers
@github-actions github-actions bot removed the A-Project Area: project label Mar 12, 2025
…t validation

- Add direct support for thematic break literal tokens
- Enhance thematic break block parsing to handle whitespace and marker variations
- Implement stricter validation to prevent invalid thematic break content
- Add comments to clarify thematic break detection logic
…ine handling

- Refactor blockquote parsing to handle nested blockquotes and continuation lines
- Improve line parsing logic to support more complex blockquote structures
- Simplify content node creation and add more robust parsing mechanism
- Enhance end-of-blockquote detection with better line and EOF handling
…ed logic

- Simplify header block parsing by removing redundant diagnostic creation
- Enhance whitespace handling with a more concise consumption method
- Remove explicit text node creation for header content
- Improve header validation and parsing robustness
…rkers and thematic breaks

- Enhance lexer to handle list markers for stars, minus, digits, and underscores
- Implement more robust thematic break detection with flexible marker parsing
- Separate parsing logic for different marker types to improve code clarity
- Add support for various list marker formats and whitespace handling
- Simplify document parsing by removing unnecessary checks and improving block handling
- Introduce a checkpoint mechanism for better fallback to paragraph parsing
- Streamline block element recognition and improve error handling for unrecognized tokens
- Add support for '+' as a list marker and enhance overall lexer functionality
- Update tests to allow for development diagnostics and improve tree structure printing
… function names and logic

- Rename functions for clarity: `at_header_block` to `at_atx_header` and `parse_header_block` to `parse_atx_header`
- Improve header validation by ensuring whitespace follows hash symbols
- Streamline content parsing logic for ATX headers, including handling of trailing hashes
- Add a new function `at_setext_header` for future Setext header support
…alidation and diagnostics

- Introduce detailed logging for whitespace and marker checks in unordered and ordered list parsing
- Refactor list item parsing functions to include parent indentation tracking
- Improve whitespace handling after list markers and ensure valid content parsing
- Streamline continuation line handling for list items and enhance overall structure
- Consolidate paragraph parsing by removing redundant checks and simplifying inline element handling
- Introduce a checkpoint mechanism to improve content validation and error handling
- Enhance whitespace management and ensure proper handling of leading spaces and newlines
- Create a more efficient structure for paragraph item list creation and completion
@github-actions github-actions bot added the A-Tooling Area: internal tools label Mar 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CLI Area: CLI A-Parser Area: parser A-Tooling Area: internal tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants