Skip to content

YAML scanner improvements #84

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nathany opened this issue Oct 28, 2012 · 3 comments
Open

YAML scanner improvements #84

nathany opened this issue Oct 28, 2012 · 3 comments
Milestone

Comments

@nathany
Copy link
Contributor

nathany commented Oct 28, 2012

YAML scanner has problems with unusual mapping keys

Trans reported:

Found an issue with the YAML syntax highlighting in CodeRay. Try it on the following example.
When it hits mri <1.9 it starts reverse highlighting almost every other line.

  - package: system_timer
    engine : mri <1.9

The problem is that the scanner doesn't allow white space before the colon. That's an easy fix; however, the set of characters allowed in keys is limited and non spec-conform.

See http://yaml.org/spec/1.2/spec.html#id2788859 for further reading.

A patch for the 1.0 trunk:

Index: /Users/murphy/ruby/coderay/lib/coderay/scanners/yaml.rb
===================================================================
--- /Users/murphy/ruby/coderay/lib/coderay/scanners/yaml.rb (revision 580)
+++ /Users/murphy/ruby/coderay/lib/coderay/scanners/yaml.rb (working copy)
@@ -75,20 +75,18 @@
           when match = scan(/[,{}\[\]]/)
             encoder.text_token match, :operator
             next
-          when state == :initial && match = scan(/[\w.() ]*\S(?=: |:$)/)
+          when state == :initial && match = scan(/[\w.() ]*\S(?= *:(?: |$))/)
             encoder.text_token match, :key
             key_indent = column(pos - match.size - 1)
-            # encoder.text_token key_indent.inspect, :debug
             state = :colon
             next
-          when match = scan(/(?:"[^"\n]*"|'[^'\n]*')(?=: |:$)/)
+          when match = scan(/(?:"[^"\n]*"|'[^'\n]*')(?= *:(?: |$))/)
             encoder.begin_group :key
             encoder.text_token match[0,1], :delimiter
             encoder.text_token match[1..-2], :content
             encoder.text_token match[-1,1], :delimiter
             encoder.end_group :key
             key_indent = column(pos - match.size - 1)
-            # encoder.text_token key_indent.inspect, :debug
             state = :colon
             next
           when match = scan(/(![\w\/]+)(:([\w:]+))?/)

From Redmine: http://odd-eyed-code.org/issues/231


YAML scanner doesn't recognize false and true

It should also recognize -.Inf and such. Some readable part of the spec is about the Core Schema which Ruby seems to use.

Needs more investigation.

Index: lib/coderay/scanners/yaml.rb
===================================================================
--- lib/coderay/scanners/yaml.rb    (revision 742)
+++ lib/coderay/scanners/yaml.rb    (working copy)
@@ -11,6 +11,11 @@

     KINDS_NOT_LOC = :all

+    CONSTANTS = %w[ true True TRUE false False FALSE null Null NULL ]  # :nodoc:
+    
+    IDENT_KIND = WordList.new(nil).
+      add(CONSTANTS, :pre_constant)
+    
   protected

     def scan_tokens encoder, options
@@ -59,7 +64,11 @@
             encoder.text_token matched, :content if scan(/(?:\n+ {#{string_indent + 1}}.*)+/)
             encoder.end_group :string
             next
-          when match = scan(/(?![!"*&]).+?(?=$|\s+#)/)
+          when match = scan(/(?![!*&\[\],{}]|- ).+?(?=$|\s+#)/)
+            if kind = IDENT_KIND[match]
+              encoder.text_token match, kind
+              next
+            end
             encoder.begin_group :string
             encoder.text_token match, :content
             string_indent = key_indent || column(pos - match.size - 1)
@@ -116,6 +125,9 @@
           when match = scan(/:\w+/)
             encoder.text_token match, :symbol
             next
+          when match = scan(/~|\.(?:inf|Inf|INF|nan|NaN|NAN)/)
+            encoder.text_token match, :pre_constant
+            next
           when match = scan(/[^:\s]+(:(?! |$)[^:\s]*)* .*/)
             encoder.text_token match, :error
             next

From Redmine: http://odd-eyed-code.org/issues/234


YAML scanner doesn't recognize - -

required_ruby_version: !ruby/object:Gem::Requirement 
  requirements: 
  - - ">=" 
    - !ruby/object:Gem::Version 
      version: 1.8.2
  version: 

The third line is currently interpreted as operator(- )string(- ">="), which is wrong. It should be operator(- )operator(- )string(">=").

From Redmine: http://odd-eyed-code.org/issues/237


The yaml.multiline example is broken

The YAML scanner doesn't seem to tokenize multiline strings correctly.

From Redmine: http://odd-eyed-code.org/issues/238


YAML scanner doesn't recognize arrays

...like [1, 2, 3].

From Redmine: http://odd-eyed-code.org/issues/239

@korny
Copy link
Member

korny commented Mar 10, 2013

Not sure how to approach this ticket. It seems to me that the YAML scanner needs to be rewritten. Questions:

  • What are the most problematic shortcomings of the current YAML scanner?
  • Is the Pygments one any better?
  • What's the best approach to scanning such an over-complicated "natural" format? Heuristics which lots of special cases or a more simple/rigid scanner that's optimized for the typical style people use?

@nathany
Copy link
Contributor Author

nathany commented Mar 11, 2013

I wonder if it would make sense to base this on a subset, such as SafeYAML, treating anything else like an error?

@korny
Copy link
Member

korny commented Mar 12, 2013

I don't think SafeYAML is any easier to parse. The problem is the indentation-based syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants