Skip to content

Weird issue with Truncated SVD and NumericStringConvertor #226

Open
@MihailoJoksimovic

Description

@MihailoJoksimovic

So it took me ages to figure out the WHY, but I finally pinpointed some extremely weird behavior.

Namely, here's the simples code that reproduces the issue:

$dataset = \Rubix\ML\Datasets\Labeled::build([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa'])->apply(new NumericStringConverter());

$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);

$dataset->apply($transformer);

var_dump($dataset);

Output

object(Rubix\ML\Datasets\Labeled)#2 (2) {
  ["labels":protected]=>
  array(2) {
    [0]=>
    string(6) "setosa"
    [1]=>
    string(7) "variosa"
  }
  ["samples":protected]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      float(0)
      [1]=>
      float(0)
    }
    [1]=>
    array(2) {
      [0]=>
      float(0)
      [1]=>
      float(0)
    }
  }
}

As you can see - it's all zeros.

Now, removing the NumericStringConverter:

$dataset = \Rubix\ML\Datasets\Labeled::build([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3, 1.4, 0.2]
], ['setosa', 'variosa']);

$transformer = new \Rubix\ML\Transformers\TruncatedSVD(2);

$dataset->apply($transformer);

var_dump($dataset);

Gives following output:

object(Rubix\ML\Datasets\Labeled)#2 (2) {
  ["labels":protected]=>
  array(2) {
    [0]=>
    string(6) "setosa"
    [1]=>
    string(7) "variosa"
  }
  ["samples":protected]=>
  array(2) {
    [0]=>
    array(2) {
      [0]=>
      float(-6.3431263560806)
      [1]=>
      float(-0.1573150685585)
    }
    [1]=>
    array(2) {
      [0]=>
      float(-5.9145190147327)
      [1]=>
      float(0.16871521675666)
    }
  }
}

Now, it took me hours to figure out WTF is happening, because, apparently, nothing spectacular is ... BUT ... BUT! I pinpointed the issue to the following line in NumericStringCoverter:

    protected function convertToNumber(array &$sample) : void
    {
        foreach ($sample as &$value) {
            if (is_string($value)) {
                if (is_numeric($value)) {
                    $value = (int) $value == $value
                        ? (int) $value
                        : (float) $value;

                    continue;
                }

This foreach loop that passes reference to $value is the culprit! By replacing it with:

        foreach ($sample as $key => $value) {
            if (is_string($value)) {
                if (is_numeric($value)) {
                    $sample[$key] = (int) $value == $value
                        ? (int) $value
                        : (float) $value;

                    continue;
                }

all works as expected really!

This leads me to conclusion that for whatever WEIRD reason, something happens internally that messes up the SVD process. Now the problem is that SVD is written as C extension and I honestly have no clue how to debug that :)

My question is -- do you see this as a bug in NumericStringConverter or in C extension? If it's former, I'd be happy to submit a bugfix really!

Activity

andrewdalpino

andrewdalpino commented on May 15, 2022

@andrewdalpino
Member

Hey @MihailoJoksimovic yeah I've run into this problem before with SVD, unfortunately, I have not had the time to debug the issue. Maybe create an issue in the Tensor repo and see if someone can fix it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @MihailoJoksimovic@andrewdalpino

        Issue actions

          Weird issue with Truncated SVD and NumericStringConvertor · Issue #226 · RubixML/ML