Search code, repositories, users, issues, pull requests...

opened

on Mar 24, 2022

· edited by chrisjsewell

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.

Problem

Heya, I would like to directly import/use the compiler, on a pre-created list of events.
I think this is currently not possible?

Obviously this is the key function provided by this package, then fromMarkdown is just a wrapper around it and the upstream postprocess/parse/postprocess functions (all importable)

Solution

Allow for e.g.

import {compiler} from 'mdast-util-from-markdown/lib/index'

compiler(options)(events)

I guess this just requires the addition of export function compiler..., and a small modification of package.json, ilke in micromark itself:

{
  "exports": {
    ".": {
      "development": "./dev/index.js",
      "default": "./index.js"
    },
    "./lib/index": {
      "development": "./dev/lib/index.js",
      "default": "./lib/index.js"
    },
    "./lib/index.js": {
      "development": "./dev/lib/index.js",
      "default": "./lib/index.js"
    }
  }
}

Alternatives

Don't think so

added

and removed

Member

What’s the reason you have events?
compile and events are all rather “internal” and not “pretty”

Author

To implement https://github.com/executablebooks/myst-spec, and replace our current markdown-it implementation: https://github.com/executablebooks/markdown-it-docutils, I need to be able to perform nested/incremental parsing:

// mdast nested parsing proof-of-principle
import {parse} from 'micromark/lib/parse'
import {postprocess} from 'micromark/lib/postprocess'
import {preprocess} from 'micromark/lib/preprocess'
import {compiler} from 'mdast-util-from-markdown/lib/index'

// Take the following example. The problem here is that:
// (a) we first want to do a top-level parse of the source file, not processing the directive
// (b) we then want to do a nested parse of the directive content,
//     but within the "context" of the top-level parse.
const content = `
Paragraph

\`\`\`{note}
[x]
\`\`\`

[x]: https://www.google.com
`

// This is where we would load MyST specific plugins and configuration
const options = {}
// This adapted parser allows us to pre-set the parsing context,
// for the starting position of the text (in the source file),
// and any previously parsed definition identifiers (for the definition lookup). 
function parseMarkdown(content, options, initialPosition, defined) {
    const parser = parse(options)
    parser.defined.push(...(defined || []))
    const events = postprocess(
        parser.document(initialPosition).write(preprocess()(content, 'utf8', true))
    )
    return {mdast: compiler(options)(events), defined: parser.defined}
}

// (a) first we perform the top-level parse
const {mdast, defined} = parseMarkdown(content, options)

// we then get the initial AST, and also any identifiers for definitions
console.log(mdast)
console.log(defined)

// ... some extra steps here would identify the directive,
// and give us its content and the content starting position
const nestedContent = `[x]`
const initialPosition = {line: 4, column: 1, offset: 0}

// If we did not provide the definition identifiers here then,
// by the CommonMark spec, the reference would simply be parsed as text. 
const {mdast: mdastNested} = parseMarkdown(nestedContent, options, initialPosition, defined)

Trust me, I know the "unprettiness" of Markdown parsing 😅, I'm also the author of https://github.com/executablebooks/markdown-it-py

Events and compilers are already documented as part of your core parsing architecture: https://github.com/micromark/micromark#architecture, so I would not necessarily say they are completely "internal" 😬

Author

FYI, if we can get all this working, then we are hoping to utilise it as the core parsing architecture in products such as https://curvenote.com/, https://irydium.dev/ and https://github.com/agoose77/jupyterlab-markup 😄

Member

// Take the following example. The problem here is that:
// (a) we first want to do a top-level parse of the source file, not processing the directive
// (b) we then want to do a nested parse of the directive content,
//     but within the "context" of the top-level parse.

Can you expand on this? Markdown already allows for (a). What is the “context” you mean in (b)?

Author

The context is:

Initialising the parse with the correct initial position, so that all the node positions point to their correct places in the source file. You could do this retroactively, in a post-processing step, but it's nicer to do in one parse
Initialising the parser with known definition/footnote identifiers. This is the key point really, because CommonMark only parses definition references of known definitions (otherwise treating them as plain text), you have to have this context of "found" definitions.
It would be great if CommonMark, would just parse all [x] syntax as definition references, irrespective of what definitions are present, then allow the renderer to handle missing definitions, but such is life 😒.

https://github.com/micromark/micromark/blob/fc5e2d8b83eb9c01c9bfd2f4b1ea4e42e6a7e224/packages/micromark-util-types/index.js#L20

Member

Why not integrate with micromark in an extension?
Extensions parse their thing and they can annotate that some stuff inside them should be parsed next

Author

Why not integrate with micromark in an extension?

Possibly, but it then means that "everything" has to be parsed in a single parse, and makes things a lot less "modular" and incremental

the idea with these directives, is that you perform an initial parse, which just identifies the directives

```{note}
Internal *markdown*
```
```{note}
more
```

which gets you to an intermediate AST

<directive name="note">
    Internal *markdown*
<directive name="note">
    more

Then you perform a subsequent parse, which processes the directives and gets you to your final AST:

<admonition type="note">
  <paragraph>
    <text>
        Internal
    <emphasis>
        <text>
           markdown
<admonition type="note">
  <paragraph>
    <text>
        more

This makes it a lot easier than having to do everything at the micromark "level"

https://spec.commonmark.org/dingus/?text=%20%20%20%60%60%60%7Bnote%7D%0A%20%20Internal%0A%20*markdown*%0Amore%0A%60%60%60

Member

the thing is that with tracking position (one thing) but importantly all the definition identifier stuff, you’re replicating a lot of the work.

Also note that the positional info is not going to be 100% if you have mdast for fenced code, and then parse its result, because an funky “indent”/exdent is allowed:

This makes it a lot easier than having to do everything at the micromark "level"

Uhhh, this post is about juggling micromark internals to not have to make a micromark extension? How is that easier? 🤔 I don‘t get it.

It sounds simpler to

copy/paste https://github.com/micromark/micromark/blob/main/packages/micromark-core-commonmark/dev/lib/code-fenced.js
look at https://github.com/micromark/micromark-extension-directive/blob/7e2f9384e6ccb35e1182e2471851918be37c3fd4/dev/lib/directive-container.js#L149-L152 to see how it handles internal containers

Then you perform a subsequent parse, which processes the directives and gets you to your final AST:

micromark already does that? It has it built in. Why do you need separate stages?

Member

How are you using “incremental”?

Author

micromark already does that? It has it built in. Why do you need separate stages?

Hmmm, I feel I'm not explaining directives properly to you; processing directive content is not just about parsing, its about node generation. Directives need to be able to generate MDAST nodes, and these nodes do not necessarily relate directly to syntax in the source text.

Take the figure directive:

This:

```{figure} https://via.placeholder.com/150
This is the figure caption!

Something! A legend!?
```

needs to go to this:

    title: Simple figure
    id: container
    mdast:
      type: root
      children:
        - type: directive
          kind: figure
          args: https://via.placeholder.com/150
          value: |-
            This is the figure caption!
            Something! A legend!?
          children:
            - type: container
              kind: figure
              children:
                - type: image
                  url: https://via.placeholder.com/150
                - type: caption
                  children:
                    - type: paragraph
                      children:
                        - type: text
                          value: This is the figure caption!
                - type: legend
                  children:
                    - type: paragraph
                      children:
                        - type: text
                          value: Something! A legend!?

How would you even go about getting a micromark extension to achieve this?

It is a lot easier to work at the MDAST node level than the micromark event level, when processing directives.
But you do need to have a way to perform nested parsing.

This is exactly how docutils/sphinx directives work; you are generating nodes, and only performing nested parsing when necessary: https://github.com/live-clones/docutils/blob/6548b56d9ea9a3e101cd62cfcd727b6e9e8b7ab6/docutils/docutils/parsers/rst/directives/images.py#L146

Author

FYI, I also know of https://github.com/micromark/micromark-extension-directive, but these directives are quite different, in that their content is "interpreted" text, i.e. it might not be Markdown.

Take for example csv-table: https://docutils.sourceforge.io/docs/ref/rst/directives.html#csv-table-1

```{csv-table}
:header: "Treat", "Quantity", "Description"
:widths: 15, 10, 30

"Albatross", 2.99, "On a stick!"
"Crunchy Frog", 1.49, "If we took the bones out, it wouldn't be crunchy, now would it?"
"Gannet Ripple", 1.99, "On a stick!"
```

Here, the content will be converted into table nodes, which is not something that can be done in a micromark extension.

Member

Thanks for expanding. I now understand the use case better, particularly why it’s a choice at the AST level, after the initial parse, to parse subdocuments.

I do find your earlier statements about wanting to reuse identifiers of “outer” definitions in these “inner” a bit weird. If they are really so separate and optional, it seems beneficial to have them “sandboxed” from the outer content, and in other words it seems to be at odds with your goal to reuse identifiers.

How would you even go about getting a micromark extension to achieve this?

I don’t see why not? micromark can parse that syntax. Though micromark is a level under mdast. So micromark would parse the syntax. A utility would turn the events into that tree.

It is a lot easier to work at the MDAST node level than the micromark event level, when processing directives.
But you do need to have a way to perform nested parsing.

I am not suggesting to do the “Processing directives” part in micromark. As I understand it we both believe that that can happen in mdast.
I am suggesting to “perform nested parsing” in micromark. Because markdown does “nested” already: micromark has this builtin.

Have you seen https://github.com/micromark/micromark#extending-markdown?
Where are you, on a scale from X to Y, between “we have a ton of content in the wild using this so we can’t change” and “we can still come up with new and improved ways”?

This issue is about compile, but you also mentioned:

Add options.startPoint support to micromark (and: how to even handle indents?)
How to pass “existing” identifiers? (and: how even to do that for extensions (footnotes))

How important are these to you? Are there other subissues you percieve?

19 remaining items

closed this as completed

removed

Author

Also feel free to keep on discussing here!

Yeh no worries

yamachu

mentioned this

on Sep 7, 2022

remark-parse/lib/index.d.ts imports a module that cannot be referenced remarkjs/remark#1039

would you be open to adding options.from so that it can be passed to document?

from can be passed to createTokenizer when working with micromark, but because the compiler function is not exported, i cannot make any use of the option without reimplementing the compiler myself.

export function fromMarkdown(value, encoding, options) {
  if (typeof encoding !== 'string') {
    options = encoding
    encoding = undefined
  }

  return compiler(options)(
    postprocess(
      parse(options)
        // .document()
        .document(options.from)
        .write(preprocess()(value, encoding, true))
    )
  )
}

Member

Hi Lex! Uhm, maybe, maybe not? Sounds like you want to increment positional info. I could see that not work the way you want. Can you elaborate more on your use case?

The reason I think it will not work, is that there are probably multiple gaps.

/**
 * Some *markdown
 * more* markdown.
 */

There’s a gap before more too. A similar problem occurs in MDX, where the embedded JS expressions can have markdown prefixes:

> <Math value={1 +
> 2} />

A better solution might be around https://github.com/vfile/vfile-location, similar to vfile/vfile-location#14, and the “stops” in mdxjs-rs: https://github.com/wooorm/markdown-rs/blob/60db8e5896be05d23137a6bdb806e63519171f9e/src/util/mdx_collect.rs#L24.

i'm not sure i understand your example 😅

i'm working on an ast for docblocks that supports markdown in comments, so mdast node positions need to be relative to my comment nodes.

i ended up using transforms to apply my positioning logic, but feel it to be quite messy. based on some soft "tests", options.from would be more ideal

Member

There are several gaps. from only gives info for the start of the first line. There are multiple lines. If you want what you want, you’d need multiple froms. That doesn‘t exist. I don’t think this does what you want.

from is this place:

/**
 * |Some *markdown
 * more* markdown.
 */

Here your positional info is out of date again:

/**
 * Some *markdown
| * more* markdown.
 */

I recommend taking more time with my previous comment. Trying to grasp what it says. I think I describes the problem well, for your case, but also for MDX, and then shows how it is solved for MDX, which is what I believe you need to do too.

oh i see, but i actually do want the initial from so the root node doesn't start at 1:1. i already have the logic to account for comment delimiters, if thats what you meant by gaps/multiple froms.

Member

My point is that you want that and more. Having just that is not enough for you.

Member

Please try to patch-package this issue, or edit it in your node_modules locally, and check if that works for you? I don't think it will.

i think that is where our disconnect is. i know options.from isn't enough by itself, but it would be useful for markdown "chunks" spanning one line (i.e. a one line description) because no shifting is needed. for chunks spanning more than one line, options.from is useful so i can start my calculations from the given start point instead of 1:1.

i came to this conclusion because my soft "tests" included editing node_modules locally, lol.

Member

It could theoretically be useful for a hypothetical human. I’m not interested in adding things that might be useful to someone in the future, as I find that often, that future user practically wants something else.

Meanwhile, I believe you are helped with vfile/vfile-location#14 and stops from mdx_collect.

is that your suggested approach for pure markdown snippets as well?

additionally, from what i see, that issue is about max line length, which isn't what i'm looking for.