Skip to content

Export compiler function #29

@chrisjsewell

Description

@chrisjsewell

Initial checklist

  • I read the support docs
    I read the contributing guide
    I agree to follow the code of conduct
    I searched issues and couldn’t find anything (or linked relevant results below)

Problem

Heya, I would like to directly import/use the compiler, on a pre-created list of events.
I think this is currently not possible?

Obviously this is the key function provided by this package, then fromMarkdown is just a wrapper around it and the upstream postprocess/parse/postprocess functions (all importable)

Solution

Allow for e.g.

import {compiler} from 'mdast-util-from-markdown/lib/index'

compiler(options)(events)

I guess this just requires the addition of export function compiler..., and a small modification of package.json, ilke in micromark itself:

{
  "exports": {
    ".": {
      "development": "./dev/index.js",
      "default": "./index.js"
    },
    "./lib/index": {
      "development": "./dev/lib/index.js",
      "default": "./lib/index.js"
    },
    "./lib/index.js": {
      "development": "./dev/lib/index.js",
      "default": "./lib/index.js"
    }
  }
}

Alternatives

Don't think so

Activity

added
👋 phase/newPost is being triaged automatically
🤞 phase/openPost is being triaged manually
and removed
👋 phase/newPost is being triaged automatically
on Mar 24, 2022
wooorm

wooorm commented on Mar 25, 2022

@wooorm
Member

What’s the reason you have events?
compile and events are all rather “internal” and not “pretty”

chrisjsewell

chrisjsewell commented on Mar 25, 2022

@chrisjsewell
Author

To implement https://github.com/executablebooks/myst-spec, and replace our current markdown-it implementation: https://github.com/executablebooks/markdown-it-docutils, I need to be able to perform nested/incremental parsing:

// mdast nested parsing proof-of-principle
import {parse} from 'micromark/lib/parse'
import {postprocess} from 'micromark/lib/postprocess'
import {preprocess} from 'micromark/lib/preprocess'
import {compiler} from 'mdast-util-from-markdown/lib/index'

// Take the following example. The problem here is that:
// (a) we first want to do a top-level parse of the source file, not processing the directive
// (b) we then want to do a nested parse of the directive content,
//     but within the "context" of the top-level parse.
const content = `
Paragraph

\`\`\`{note}
[x]
\`\`\`

[x]: https://www.google.com
`

// This is where we would load MyST specific plugins and configuration
const options = {}
// This adapted parser allows us to pre-set the parsing context,
// for the starting position of the text (in the source file),
// and any previously parsed definition identifiers (for the definition lookup). 
function parseMarkdown(content, options, initialPosition, defined) {
    const parser = parse(options)
    parser.defined.push(...(defined || []))
    const events = postprocess(
        parser.document(initialPosition).write(preprocess()(content, 'utf8', true))
    )
    return {mdast: compiler(options)(events), defined: parser.defined}
}

// (a) first we perform the top-level parse
const {mdast, defined} = parseMarkdown(content, options)

// we then get the initial AST, and also any identifiers for definitions
console.log(mdast)
console.log(defined)

// ... some extra steps here would identify the directive,
// and give us its content and the content starting position
const nestedContent = `[x]`
const initialPosition = {line: 4, column: 1, offset: 0}

// If we did not provide the definition identifiers here then,
// by the CommonMark spec, the reference would simply be parsed as text. 
const {mdast: mdastNested} = parseMarkdown(nestedContent, options, initialPosition, defined)

Trust me, I know the "unprettiness" of Markdown parsing 😅, I'm also the author of https://github.com/executablebooks/markdown-it-py

Events and compilers are already documented as part of your core parsing architecture: https://github.com/micromark/micromark#architecture, so I would not necessarily say they are completely "internal" 😬

chrisjsewell

chrisjsewell commented on Mar 25, 2022

@chrisjsewell
Author

FYI, if we can get all this working, then we are hoping to utilise it as the core parsing architecture in products such as https://curvenote.com/, https://irydium.dev/ and https://github.com/agoose77/jupyterlab-markup 😄

wooorm

wooorm commented on Mar 25, 2022

@wooorm
Member
// Take the following example. The problem here is that:
// (a) we first want to do a top-level parse of the source file, not processing the directive
// (b) we then want to do a nested parse of the directive content,
//     but within the "context" of the top-level parse.

Can you expand on this? Markdown already allows for (a). What is the “context” you mean in (b)?

chrisjsewell

chrisjsewell commented on Mar 25, 2022

@chrisjsewell
Author

The context is:

  1. Initialising the parse with the correct initial position, so that all the node positions point to their correct places in the source file. You could do this retroactively, in a post-processing step, but it's nicer to do in one parse

  2. Initialising the parser with known definition/footnote identifiers. This is the key point really, because CommonMark only parses definition references of known definitions (otherwise treating them as plain text), you have to have this context of "found" definitions.
    It would be great if CommonMark, would just parse all [x] syntax as definition references, irrespective of what definitions are present, then allow the renderer to handle missing definitions, but such is life 😒.

wooorm

wooorm commented on Mar 25, 2022

@wooorm
Member

Why not integrate with micromark in an extension?
Extensions parse their thing and they can annotate that some stuff inside them should be parsed next

https://github.com/micromark/micromark/blob/fc5e2d8b83eb9c01c9bfd2f4b1ea4e42e6a7e224/packages/micromark-util-types/index.js#L20

chrisjsewell

chrisjsewell commented on Mar 25, 2022

@chrisjsewell
Author

Why not integrate with micromark in an extension?

Possibly, but it then means that "everything" has to be parsed in a single parse, and makes things a lot less "modular" and incremental

the idea with these directives, is that you perform an initial parse, which just identifies the directives

```{note}
Internal *markdown*
```
```{note}
more
```

which gets you to an intermediate AST

<directive name="note">
    Internal *markdown*
<directive name="note">
    more

Then you perform a subsequent parse, which processes the directives and gets you to your final AST:

<admonition type="note">
  <paragraph>
    <text>
        Internal
    <emphasis>
        <text>
           markdown
<admonition type="note">
  <paragraph>
    <text>
        more

This makes it a lot easier than having to do everything at the micromark "level"

wooorm

wooorm commented on Mar 25, 2022

@wooorm
Member

the thing is that with tracking position (one thing) but importantly all the definition identifier stuff, you’re replicating a lot of the work.

Also note that the positional info is not going to be 100% if you have mdast for fenced code, and then parse its result, because an funky “indent”/exdent is allowed:

https://spec.commonmark.org/dingus/?text=%20%20%20%60%60%60%7Bnote%7D%0A%20%20Internal%0A%20*markdown*%0Amore%0A%60%60%60

This makes it a lot easier than having to do everything at the micromark "level"

Uhhh, this post is about juggling micromark internals to not have to make a micromark extension? How is that easier? 🤔 I don‘t get it.

It sounds simpler to

Then you perform a subsequent parse, which processes the directives and gets you to your final AST:

micromark already does that? It has it built in. Why do you need separate stages?

wooorm

wooorm commented on Mar 25, 2022

@wooorm
Member

How are you using “incremental”?

chrisjsewell

chrisjsewell commented on Mar 25, 2022

@chrisjsewell
Author

micromark already does that? It has it built in. Why do you need separate stages?

Hmmm, I feel I'm not explaining directives properly to you; processing directive content is not just about parsing, its about node generation. Directives need to be able to generate MDAST nodes, and these nodes do not necessarily relate directly to syntax in the source text.

Take the figure directive:

This:

```{figure} https://via.placeholder.com/150
This is the figure caption!

Something! A legend!?
```

needs to go to this:

    title: Simple figure
    id: container
    mdast:
      type: root
      children:
        - type: directive
          kind: figure
          args: https://via.placeholder.com/150
          value: |-
            This is the figure caption!
            Something! A legend!?
          children:
            - type: container
              kind: figure
              children:
                - type: image
                  url: https://via.placeholder.com/150
                - type: caption
                  children:
                    - type: paragraph
                      children:
                        - type: text
                          value: This is the figure caption!
                - type: legend
                  children:
                    - type: paragraph
                      children:
                        - type: text
                          value: Something! A legend!?

How would you even go about getting a micromark extension to achieve this?

It is a lot easier to work at the MDAST node level than the micromark event level, when processing directives.
But you do need to have a way to perform nested parsing.

This is exactly how docutils/sphinx directives work; you are generating nodes, and only performing nested parsing when necessary: https://github.com/live-clones/docutils/blob/6548b56d9ea9a3e101cd62cfcd727b6e9e8b7ab6/docutils/docutils/parsers/rst/directives/images.py#L146

chrisjsewell

chrisjsewell commented on Mar 26, 2022

@chrisjsewell
Author

FYI, I also know of https://github.com/micromark/micromark-extension-directive, but these directives are quite different, in that their content is "interpreted" text, i.e. it might not be Markdown.

Take for example csv-table: https://docutils.sourceforge.io/docs/ref/rst/directives.html#csv-table-1

```{csv-table}
:header: "Treat", "Quantity", "Description"
:widths: 15, 10, 30

"Albatross", 2.99, "On a stick!"
"Crunchy Frog", 1.49, "If we took the bones out, it wouldn't be crunchy, now would it?"
"Gannet Ripple", 1.99, "On a stick!"
```

Here, the content will be converted into table nodes, which is not something that can be done in a micromark extension.

wooorm

wooorm commented on Mar 27, 2022

@wooorm
Member

Thanks for expanding. I now understand the use case better, particularly why it’s a choice at the AST level, after the initial parse, to parse subdocuments.

I do find your earlier statements about wanting to reuse identifiers of “outer” definitions in these “inner” a bit weird. If they are really so separate and optional, it seems beneficial to have them “sandboxed” from the outer content, and in other words it seems to be at odds with your goal to reuse identifiers.

How would you even go about getting a micromark extension to achieve this?

I don’t see why not? micromark can parse that syntax. Though micromark is a level under mdast. So micromark would parse the syntax. A utility would turn the events into that tree.

It is a lot easier to work at the MDAST node level than the micromark event level, when processing directives.
But you do need to have a way to perform nested parsing.

I am not suggesting to do the “Processing directives” part in micromark. As I understand it we both believe that that can happen in mdast.
I am suggesting to “perform nested parsing” in micromark. Because markdown does “nested” already: micromark has this builtin.



This issue is about compile, but you also mentioned:

  • Add options.startPoint support to micromark (and: how to even handle indents?)
  • How to pass “existing” identifiers? (and: how even to do that for extensions (footnotes))

How important are these to you? Are there other subissues you percieve?

19 remaining items

chrisjsewell

chrisjsewell commented on Mar 29, 2022

@chrisjsewell
Author

Also feel free to keep on discussing here!

Yeh no worries

unicornware

unicornware commented on Apr 11, 2024

@unicornware

@wooorm

would you be open to adding options.from so that it can be passed to document?

from can be passed to createTokenizer when working with micromark, but because the compiler function is not exported, i cannot make any use of the option without reimplementing the compiler myself.

export function fromMarkdown(value, encoding, options) {
  if (typeof encoding !== 'string') {
    options = encoding
    encoding = undefined
  }

  return compiler(options)(
    postprocess(
      parse(options)
        // .document()
        .document(options.from)
        .write(preprocess()(value, encoding, true))
    )
  )
}
wooorm

wooorm commented on Apr 13, 2024

@wooorm
Member

Hi Lex! Uhm, maybe, maybe not? Sounds like you want to increment positional info. I could see that not work the way you want. Can you elaborate more on your use case?

The reason I think it will not work, is that there are probably multiple gaps.

/**
 * Some *markdown
 * more* markdown.
 */

There’s a gap before more too. A similar problem occurs in MDX, where the embedded JS expressions can have markdown prefixes:

> <Math value={1 +
> 2} />

A better solution might be around https://github.com/vfile/vfile-location, similar to vfile/vfile-location#14, and the “stops” in mdxjs-rs: https://github.com/wooorm/markdown-rs/blob/60db8e5896be05d23137a6bdb806e63519171f9e/src/util/mdx_collect.rs#L24.

unicornware

unicornware commented on Jul 5, 2024

@unicornware

@wooorm

i'm not sure i understand your example 😅

i'm working on an ast for docblocks that supports markdown in comments, so mdast node positions need to be relative to my comment nodes.

i ended up using transforms to apply my positioning logic, but feel it to be quite messy. based on some soft "tests", options.from would be more ideal

wooorm

wooorm commented on Jul 5, 2024

@wooorm
Member

There are several gaps. from only gives info for the start of the first line. There are multiple lines. If you want what you want, you’d need multiple froms. That doesn‘t exist. I don’t think this does what you want.

from is this place:

/**
 * |Some *markdown
 * more* markdown.
 */

Here your positional info is out of date again:

/**
 * Some *markdown
| * more* markdown.
 */

I recommend taking more time with my previous comment. Trying to grasp what it says. I think I describes the problem well, for your case, but also for MDX, and then shows how it is solved for MDX, which is what I believe you need to do too.

unicornware

unicornware commented on Jul 5, 2024

@unicornware

@wooorm

oh i see, but i actually do want the initial from so the root node doesn't start at 1:1. i already have the logic to account for comment delimiters, if thats what you meant by gaps/multiple froms.

wooorm

wooorm commented on Jul 5, 2024

@wooorm
Member

My point is that you want that and more. Having just that is not enough for you.

wooorm

wooorm commented on Jul 5, 2024

@wooorm
Member

Please try to patch-package this issue, or edit it in your node_modules locally, and check if that works for you? I don't think it will.

unicornware

unicornware commented on Jul 7, 2024

@unicornware

@wooorm

i think that is where our disconnect is. i know options.from isn't enough by itself, but it would be useful for markdown "chunks" spanning one line (i.e. a one line description) because no shifting is needed. for chunks spanning more than one line, options.from is useful so i can start my calculations from the given start point instead of 1:1.

i came to this conclusion because my soft "tests" included editing node_modules locally, lol.

wooorm

wooorm commented on Jul 8, 2024

@wooorm
Member

It could theoretically be useful for a hypothetical human. I’m not interested in adding things that might be useful to someone in the future, as I find that often, that future user practically wants something else.

Meanwhile, I believe you are helped with vfile/vfile-location#14 and stops from mdx_collect.

unicornware

unicornware commented on Jul 9, 2024

@unicornware

@wooorm

is that your suggested approach for pure markdown snippets as well?

additionally, from what i see, that issue is about max line length, which isn't what i'm looking for.

wooorm

wooorm commented on Jul 9, 2024

@wooorm
Member

That depends, is this use case a problem you have? From what you said before, I grasp that you don‘t have that problem or need that solution.

That issue is a feature request for a feature. It was brought up for a particular lint rule. That lint rule deals with line length. There are other lint rules. There is also your case, which is helped by that issue. Please though, read not just the link, but also the rest of what I mentioned:

A better solution might be around vfile/vfile-location, similar to vfile/vfile-location#14, and the “stops” in mdxjs-rs: wooorm/markdown-rs@60db8e5/src/util/mdx_collect.rs#L24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    👎 phase/noPost cannot or will not be acted on🙅 no/wontfixThis is not (enough of) an issue for this project

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @wooorm@chrisjsewell@unicornware

        Issue actions

          Export compiler function · Issue #29 · syntax-tree/mdast-util-from-markdown