Skip to content

fix: Parsing/pasting on prosemirror-model: 1.25.1 #1661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 9, 2025

Conversation

matthewlipski
Copy link
Collaborator

@matthewlipski matthewlipski commented May 6, 2025

Issue

This PR addresses breaking changes to parsing introduced by prosemirror-model: 1.25.1. Prior to this update, the DOMParser would drop any nodes parsed that were not valid in the schema, based on the parent node. For example, take a look at the following HTML:

<ul>
  <li>
    <p>
      ...
    </p>
  </li>
</ul>

When parsing this before the update, the li element would get parsed as a bulletListItem node. Within it, the p element should also get parsed as a paragraph node. However, because the schema dictates that paragraph nodes can't be within bulletListItem nodes, the paragraph is dropped altogether and ignored.

After the update, the paragraph node will no longer get dropped, and the DOMParser will instead attempt to insert it somewhere so that it is valid in the schema. Since it can't be a child of the bulletListItem node, it instead gets wrapped in a blockContainer and blockGroup, which then gets inserted into the bulletListItem's parent blockContainer. In this scenario, we actually want to ignore the p tag and just parse its content.

Overall, the changes in prosemirror-model: 1.25.1 mean we have to be more diligent when writing parse rules. Paragraphs especially appear in many places, e.g. external HTML paragraphs, internal HTML paragraphs, internal HTML list items, and internal HTML table cells.

Parse rule changes

Internal HTML

After updating prosemirror-model, external HTML parse rules were being triggered when parsing blockContent nodes. This is because blockContent nodes can have all kinds of HTML inside, which we actually don't care about for parsing. To fix this, all default blockContent nodes have received the following change in their parse rules:

Before:

{
  tag: "div[data-content-type=" + this.name + "]",
},

After:

{
  tag: "div[data-content-type=" + this.name + "]",
  contentElement: ".bn-inline-content",
},

This change was made following a suggestion by Marijn here.

When parsing a blockContent element, this now tells the DOMParser to ignore all descendant elements except the one with the bn-inline-content class, and only parse its content.

Additionally, any bn-inline-content elements with the data-editable attribute have had this attribute removed.

External HTML

Because of the new parsing behaviour, we've had to add additional logic to list items and table cells.

HTML li elements may have multiple block or inline elements in them, which is incompatible with our schema as *ListItem blocks can only contain inline content. By default, the new parsing behaviour lifts all nodes that are incompatible with the schema up, so any e.g. p and h1 elements within a li are parsed as separate blocks as children of the *ListItem block. This has been modified to be more Notion-like, and you can find the logic for this explained in getListItemContent.ts.

The new default behaviour for table cells is the same as for list items, i.e. elements like p and h1 get lifted as children of the parent table block. Unlike list items though, moving content that isn't compatible with the schema to the children doesn't really make sense, so we would rather drop it altogether. This is basically how it already worked before the prosemirror-model update. However, the content of each element is now appended on the same line, whereas before, content from block-level elements would be appended to a new line (we may want to look into this again in the future).

Additionally, there's a minor fix for where media elements (embed, img, audio, and video) inside figure elements causing their respective blockContent nodes to be parsed twice.

Closes #1643
Closes #1645

Copy link

vercel bot commented May 6, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
blocknote ✅ Ready (Inspect) Visit Preview May 9, 2025 2:50pm
blocknote-website 🛑 Canceled (Inspect) May 9, 2025 2:50pm

Copy link

pkg-pr-new bot commented May 6, 2025

Open in StackBlitz

@blocknote/ariakit

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/ariakit@1661

@blocknote/code-block

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/code-block@1661

@blocknote/core

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/core@1661

@blocknote/mantine

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/mantine@1661

@blocknote/react

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/react@1661

@blocknote/server-util

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/server-util@1661

@blocknote/shadcn

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/shadcn@1661

@blocknote/xl-docx-exporter

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/xl-docx-exporter@1661

@blocknote/xl-multi-column

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/xl-multi-column@1661

@blocknote/xl-odt-exporter

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/xl-odt-exporter@1661

@blocknote/xl-pdf-exporter

npm i https://pkg.pr.new/TypeCellOS/BlockNote/@blocknote/xl-pdf-exporter@1661

commit: 044831d

Copy link
Contributor

@nperez0111 nperez0111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels wrong, could you give a write up on why this change is needed.

It feels like it litters the code everywhere, I may just be misunderstanding something here

@matthewlipski
Copy link
Collaborator Author

This feels wrong, could you give a write up on why this change is needed.

It feels like it litters the code everywhere, I may just be misunderstanding something here

Yep you were right, I went a bit overboard to make sure parse rules weren't being triggered accidentally and pretty much all of the getAttrs checks for element.closest("[data-content-type]") were redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants