Skip to content

Conversation

ruflin
Copy link
Contributor

@ruflin ruflin commented Dec 21, 2023

This is an early PR to discuss some parts further.

This is an early PR to discuss some parts further.
@ruflin ruflin self-assigned this Dec 21, 2023
type: keyword
- name: http.response.body.bytes
type: long
- name: http.response.status_code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aspacca If I use keyword as the type, enum seems to work. But if I use long, it doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin yes, that's the case, as documented at https://github.com/elastic/elastic-integration-corpus-generator-tool/blob/main/docs/fields-configuration.md

enum optional (keyword type only): list of strings to randomly chose from a value to set for the field (any cardinality will be applied limited to the size of the enum values)

The reason because you have enum only applicable to keyword is that you can have an enum with a list of numeric values, and at the same time range and fuzziness cannot apply at all to keyword. At the contrary an hypothetical enum for long should be validated for not having range and fuzziness applied as well, since having an enum excludes the other two.
It just seemed logic to me :)

Since I see this as a matter of expectations and UX I'll leave to you if a more detailed and clear documentation is enough, or we should rather extend enum applicability: please open an issue with the desired solution :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As status_code is a long in ECS, it fields natural to define it as an int: https://www.elastic.co/guide/en/ecs/current/ecs-http.html#field-http-response-status-code Is my understanding correct, supporting other types in enum is not the problem itself, it is that if we do, more validation should be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding correct, supporting other types in enum is not the problem itself, it is that if we do, more validation should be added?

yes, correct!

@@ -0,0 +1,36 @@
{{- $timestamp := generate "timestamp" }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aspacca Looking at the existing benchmarks, the first few lines are always about generating the variables. There are sometimes additions to it, but first variables have to be generated. As we know all the variables that have to be generated, I wonder if we could do this already for the user outside the template. Inside the template, the dev can directly Work with all the variables by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin yes, I see your point

The only "problem" I see is that we should move to a predefined naming convention for the variable names, and the outcome could be less favorable than we think.

Given a field name, the name of the variable we generate must be at the same time predictable and valid in the context of the go text template. I exclude we can achieve both without dealing with names collision, and I can see only two ways of solving this collision: 1. imposing a limit in naming the fields, 2. use some arbitrary convention to discriminate colliding names.

Let me clarify with a fields.yml example:

- name: field.a
  type: long
- name: field_A
  type: keyword

we have only two options for the above:

  1. prevent the possibility to have something like "field.a" and "field_A" as field names
  2. leave that possibility and generate something like "$field_a_1" and "$field_a_2" as variables

If we decide to add the feature you propose I strongly advocate for 1, and precisely I would impose the constraint that a field name must be a valid variable name in go text template as well.

cc @tommyers-elastic , @gizas , @aliabbas-elastic , as currently involved in writing/reviewing fields.yml file and templates, what's your opinion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the important detail around the naming and that auto converting names could create collisions. If we go down this path, so far the convention seems to be if we have field.a this gets converted to fieldA and field_A would stay field_A. Still, fieldA would create a conflict.

One interesting bit is that most fields are ECS fields that we use. What if we auto generate ECS fields (lookup talbe) but all the other fields would have to be generated? Like this, everyone could easily use the ECS fields and know there is the hostName convention for these fields?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One interesting bit is that most fields are ECS fields that we use. What if we auto generate ECS fields (lookup talbe) but all the other fields would have to be generated? Like this, everyone could easily use the ECS fields and know there is the hostName convention for these fields?

Yes, I think this is a good trade-off.
The problem is not having a naming convention, but supporting potential collision in the naming convention.
Assuming ECS fields have no collision we get rid of the problem.

"type": "logs",
"dataset": "nginx.access"
},
"message": "{{$ip}} - - [{{$timestamp | date "02/Jan/2006:15:04:05"}} {{$timezoneOffset}}] \"{{$httpRequestMethod}} /favicon.ico HTTP/1.1\" {{$httpResponseStatusCode}} {{$httpResponseBodyBytes}} \"http://localhost:8080/\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.59 Safari/537.36\"",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aspacca It would be nice to have 2 helpers:

  • path: Helper that generates path that are half realistic and cardinatlity could be passed in
  • hostnames (+ port): Helper that generates hostnames optional with port numbers, agian with cardinality passed in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin let me spend some time to think about a possible workaround, but I'd say that's impossible to have helpers with cardinality that could be passed in, for the reason that template rendering is stateless.

the alternative is to express the same with an entry config rather than an helper to be used in the template, more or less what I've mentioned here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I'd say that's impossible to have helpers with cardinality that could be passed in, for the reason that template rendering is stateless.

To have cardinality, it would have to be a type in the fields.yml? Other idea: What if the path function would have a predictable generation of paths, for example internally it would have 1000 paths pregenerated and would then pick from these?

On the comment you referenced, can you add more details there, didn't fully get it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin

I might got it wrong: what do you exactly mean when you say that cardinality could be passed in to the helpers?

If it's just a matter that the helpers' output should respect a cardinality, that's easily solved having a field with cardinality as argument of the helper.

something like in config.yml:

fields:
  - name: pathHelperCardinality
    cardinality: 70
  - name: hostIPHelperCardinality
    cardinality: 30

and then in the template something like:

{{ generate "pathHelperCardinality" | path }}
{{ generate "hostIPHelperCardinality" | hostIP }}

the solution proposed in the comment I referenced is having a configuration entry to define the formatting pattern of the value to generate.

something like in config.yml:

fields:
  - name: path
    cardinality: 70
    formatting_pattern: /string/string/string
  - name: hostIP
    cardinality: 30
    formatting_pattern: ip:port

The main matter would be to define the syntax of the pattern. A good candidate could be supporting grok expressions, what do you think?

both solutions have pros and cons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For path, I'm leaning towards {{ generate "pathHelperCardinality" | path }}. The reason is that I would argue, in most log messages the exact path does not matter, it can be a generic pattern. It simplifies configuration as the user does not have to specify a pattern.

But for the hostIP, I like the idea of being able to specify if ip or ip:port should be used. We have already some examples around this in the nginx logs where both scenarios exists.

Supporting both solutions sounds overkill so I'm torn. I'm jumping forth and back between "it is nice to be able to configure things directly in the template" and "it is good to have all configs in one place and template should be as simple as possible".

@aspacca What is your take?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting both solutions sounds overkill so I'm torn. I'm jumping forth and back between "it is nice to be able to configure things directly in the template" and "it is good to have all configs in one place and template should be as simple as possible".

@aspacca What is your take?

I'm more on the "it is good to have all configs in one place and template should be as simple as possible", it's indeed the same reason for generating variables outside the template.

After considering all the tradeoffs, I've no doubt that introducing a formatting configuration, vs just implementing the two helpers you asked for, is the way to go.
The effort required for the immediate goal of being able to print a path and an hostname is more or less the same for both solutions.
Strategically helpers are not sustainable: just because every new one is blocker that must be implemented in the code.
Last but not the least: formatting configuration do not need to deal with cardinality, contrary to helpers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets go with formatting configuration then.

@@ -0,0 +1,28 @@
fields:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aspacca Writing some benchmarks, made me think again about the role of config.yml and fields.yml. If I remember correctly, both exists because there are different types of schemas that can be generated but also, because multiple configs for the same fields could exist, so definition has to be done only once.

I'm focusing on the second use case for a moment. Lets say you want to specify 3 different benchmark configs. Likely the same template would be reused. One of the 3 configs would be the base that contains all the definition. Config 2 and 3 and would be an extension of the base config, meaning some the values would be overwritten with different values. Like this, only values that are changed would have to be put in, the type of the field should never change (that is at least my assumption).

What do you think about the following logic:

  • config.yml also allows to define the type
  • If a fields.yml is around, it is automatically merged with the specified config.yml
  • A config.yml allows on the top to specify another config.yml it inherits from and gets also merged with it.

This means, all the existing benchmarks keep working. New benchmarks can only use the config.yml (if they want to) and eventually multiple benchmarks are possible in an easy way. The inheritance I would not put any focus on at the moment, more a nice to have.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin

What do you think about the following logic:

  • config.yml also allows to define the type
  • If a fields.yml is around, it is automatically merged with the specified config.yml

I honestly find it a little confusing. I see the point is to be able to have only a config.yml in a backwards compatible way.

I'm quite convinced that having well defined different and separated content for fields definition and configuration for values generation is more transparent, easier to document and overall lowers the number of special cases and/or exceptions to deal with in the code, but most import for the person that has to write the yaml files :)

for example. "config.yml also allows to define the type": problem is, if fields.yml is not present the type in config.yml is mandatory, if fields.yml is present the type in config.yml is ignored/validation fails (because I agree that the type of the field should never change)

Let me quote myself :)

well defined different and separated content

so, this can be also a single yaml file with multiple sections.
config.yml was already refactored for this, you have the fields section in it. it was introduced with the idea to add some kind of processor section where defining stuff like removing new lines etc.
we can introduce a types section that will be used in the case a fields.yml is not present (still not a big fan of ending up ignoring the section content/failing validation when it is), what do you think about that?

all the above is motived also (but not only) by the fact that integrations benchmarks from elastic-package is only one of the usages of the corpus generator. i'm obviously biased but I'd prefer to look for solutions that won't compromise that :)

finally: I'm not sure I understand what's the benefit of being able to use only config.yml or how big the benefit is :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bias is of course on the data generation of elastic-package and making it as simple as possible. I get that we are trying to find a balance here between two sometimes contradicting forces.

Lets go back to the problem I'm trying to solve. When I build an benchmark, I keep jumping forth and back between fields.yml and config.yml and each time I ask myself why I have to put the names in two places. I understand the logic but still each time I come back I look at existing files on what to put where. It could be that it is just me or something is not intuitive here. Easy solution is just to blame myself and move on. It would be good to get some input from more folks working on the templates @tommyers-elastic .

I like that as a first step we could move to single file, this would already simplify the jumping forth and back.

Copy link
Contributor

@aspacca aspacca Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that as a first step we could move to single file, this would already simplify the jumping forth and back.

@ruflin I think this is a good tradeoff and has enough isolated value on its own

so let's add a types section to config.yml: you can only have either it or a fields.yml, a validation error will occur otherwise, in order to reduce ambiguity.

btw, this requires changing package-spec in order to make fields optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

path: ./access-benchmark/template.ndjson
config:
path: ./access-benchmark/config.yml
fields:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below around the fields.yml. This would become optional.

Are there cases where multiple fields.yml could exist? If not, could we just make it a convention an no config needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where multiple fields.yml could exist?

no, there should not be any case like that.
we want to have multiple config.yml but not multiple templates. multiple templates could have a different set of fields that will require multiple fields.yml. the number of config.yml alone, with a single template, does not have any impact on this.

If not, could we just make it a convention an no config needed?

fields is still required (as well as config and template) because you can either reference a path with the content or define it raw inline (see https://github.com/elastic/package-spec/blob/main/spec/integration/_dev/benchmark/system.scenario.spec.yml#L77-L118)

If you reference a path, the name is hardcoded: https://github.com/elastic/package-spec/blob/main/spec/integration/_dev/benchmark/spec.yml#L32-L56

happy to drop support for raw and make the content of the benchmark folder mandatory, so that we can drop as well fields, config and template :)

but I'd suggest to proceed this way only if systems benchmark will align too: @marc-gr is it fine for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a sample on where raw is used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a sample on where raw is used?

not for rally benchmarks, and even if we have for system ones, we could just move the content to a file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to drop support for raw and make the content of the benchmark folder mandatory, so that we can drop as well fields, config and template :)

By dropping you mean not supporting it anymore or just have defaults in place?

@marc-gr Any chance to get your 👍 👎 on this?

@elasticmachine
Copy link

elasticmachine commented Dec 21, 2023

🚀 Benchmarks report

Package nginx 👍(2) 💚(0) 💔(0)

Expand to view
Data stream Previous EPS New EPS Diff (%) Result
access 0 3154.57 3154.57 ( - %) 👍
error 0 16949.15 16949.15 ( - %) 👍

@ruflin
Copy link
Contributor Author

ruflin commented Dec 21, 2023

/test benchmark fullreport

@botelastic
Copy link

botelastic bot commented Jan 20, 2024

Hi! We just realized that we haven't looked into this PR in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jan 20, 2024
@ruflin
Copy link
Contributor Author

ruflin commented Jan 22, 2024

@aspacca Would be great to get some feedback on the above comments

@botelastic botelastic bot removed the Stalled label Jan 22, 2024
@botelastic
Copy link

botelastic bot commented Feb 24, 2024

Hi! We just realized that we haven't looked into this PR in a while. We're sorry! We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Feb 24, 2024
@ruflin
Copy link
Contributor Author

ruflin commented Mar 4, 2024

I'm closing this PR as I don't expect the changes to be merged but a follow up is needed on the various discussions above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants