Skip to content

PowerShellForGitHub.log File Character Encoding Differs Between PowerShell 5 and 7 #237

Open
@X-Guardian

Description

@X-Guardian
Contributor

Issue Details

The PowerShellForGitHub.log is written with different character encoding depending upon whether PowerShell 5 or PowerShell 7 are used. PowerShell 5 encoding is UTF-16 LE and PowerShell 7 is UTF-8. This causes an issue if the PowerShellForGithub module is used on both PowerShell 5 and PowerShell 7 on the same computer, and corrupted entries in the log will occur.

Steps to reproduce the issue

$PowerShellForGitHubLogPath = "$([Environment]::GetFolderPath('MyDocuments'))\PowerShellForGitHub.log" 
Remove-Item -Path $PowerShellForGitHubLogPath
pwsh.exe -Command { Get-GitHubRepository -OwnerName 'Octocat' | Out-Null }
powershell.exe -Command { Get-GitHubRepository -OwnerName 'Octocat' | Out-Null }
Get-Content -Path $PowerShellForGitHubLogPath 

Review

Verbose logs showing the problem

2020-06-14 12:18:45 : Simon : VERBOSE : [0.14.0] Executing: Get-GitHubRepository -OwnerName "Octocat"
2020-06-14 12:18:45 : Simon : VERBOSE : Restoring Access Token from file.  This value can be cleared in the future by calling Clear-GitHubAuthentication.
2020-06-14 12:18:45 : Simon : VERBOSE : Getting repos for Octocat
2020-06-14 12:18:45 : Simon : VERBOSE : Accessing [Get] https://github.com/api/users/Octocat/repos? [Timeout = 0)]
2020-06-14 12:18:46 : Simon : VERBOSE : Telemetry has been disabled via configuration. Skipping reporting event [Get-GitHubRepository].
2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   [ 0 . 1 4 . 0 ]   E x e c u t i n g :   G e t - G i t H u b R e p o s i t o r y   - O w n e r N a m e   " O c t o c a t "

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   R e s t o r i n g   A c c e s s   T o k e n   f r o m   f i l e .     T h i s   v a l u e   c a n   b e   c l e a r e d   i n   t h e   f u t u r e   b y   c a l l i n g   C l e a r - G i t H u b A u 
t h e n t i c a t i o n .

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   G e t t i n g   r e p o s   f o r   O c t o c a t

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 8   :   S i m o n   :   V E R B O S E   :   A c c e s s i n g   [ G e t ]   h t t p s : / / a p i . g i t h u b . c o m / u s e r s / O c t o c a t / r e p o s ?   [ T i m e o u t   =   0 ) ]

 2 0 2 0 - 0 6 - 1 4   1 2 : 1 8 : 4 9   :   S i m o n   :   V E R B O S E   :   T e l e m e t r y   h a s   b e e n   d i s a b l e d   v i a   c o n f i g u r a t i o n .   S k i p p i n g   r e p o r t i n g   e v e n t   [ G e t - G i t H u b R e p o s i t o r y ] .

Suggested solution to the issue

Add -Encoding ascii to the Out-File cmdlet call in the Write-Log function of the Helpers.ps1 module. This will force the log output to be UTF-8 without BOM on all PowerShell versions.

Note, on PowerShell 5 and below, any current log will have been written using UTF-16 LE, so this log would need to be dealt with when making this change (backed up, converted, deleted ?)

Requested Assignment

  • I'm just reporting this problem, but don't want to fix it.

Operating System

OsName               : Microsoft Windows 10 Pro
OsOperatingSystemSKU : 48
OsArchitecture       : 64-bit
WindowsVersion       : 1903
WindowsBuildLabEx    : 18362.1.amd64fre.19h1_release.190318-1202
OsLanguage           : en-GB
OsMuiLanguages       : {en-GB, en-US}

PowerShell Version

PowerShell 5
Name                           Value
----                           -----
PSVersion                      5.1.18362.752
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.18362.752
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
PowerShell 7
Name                           Value
----                           -----
PSVersion                      7.0.1
PSEdition                      Core
GitCommitId                    7.0.1
OS                             Microsoft Windows 10.0.18362
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

Module Version

Running: 0.14.0
Installed: 0.14.0

Activity

added
bugThis relates to a bug in the existing module.
triage neededAn issue that needs to be reviewed by a member of the team.
on Jun 14, 2020
HowardWolosky

HowardWolosky commented on Jun 15, 2020

@HowardWolosky
Contributor

Thanks for catching this!

I think I had noticed it in passing and was wondering what was going on here, but not enough to look into it any further.

Given that Unicode characters can be used with the GitHub API, it makes more sense to use utf8 vs ansi, but otherwise I agree that it was an oversight that an explicit encoding format wasn't specified in that function.

removed
triage neededAn issue that needs to be reviewed by a member of the team.
on Jun 15, 2020
giuseppecampanelli

giuseppecampanelli commented on Jun 16, 2020

@giuseppecampanelli
Contributor

Is this up for grabs?

HowardWolosky

HowardWolosky commented on Jun 16, 2020

@HowardWolosky
Contributor

Is this up for grabs?

Go for it. To be clear, have it use utf8.

added
in progressWork on this issue is already underway. Please don't work start new work on it.
on Jun 16, 2020
HowardWolosky

HowardWolosky commented on Jun 16, 2020

@HowardWolosky
Contributor

Actually, play with it a bit. I'm still seeing oddities using utf8 in PS5. You might want to try unicode instead. Do some testing with the different encoding formats across both PS5 and PS7 to make sure that logging from commandline usage as well as from Pester all output the same without any unexpected characters.

giuseppecampanelli

giuseppecampanelli commented on Jun 16, 2020

@giuseppecampanelli
Contributor

Thanks and will do!

giuseppecampanelli

giuseppecampanelli commented on Jun 18, 2020

@giuseppecampanelli
Contributor

@HowardWolosky I've tested both Unicode and UTF8 encodings and both seem to solve the issue. The issue was caused due to the fact that Out-File in PS5 has a default encoding of Unicode and Out-File in PS7 has a default encoding of UTF8NoBOM so they were mismatched.

I couldn't notice any issues with either, what were the oddities you noticed with UTF8? I tested outputs from both executing the command directly and from Pester.

HowardWolosky

HowardWolosky commented on Jun 21, 2020

@HowardWolosky
Contributor

It looks like the issue I saw was related to what @X-Guardian mentioned in his initial post:

Note, on PowerShell 5 and below, any current log will have been written using UTF-16 LE, so this log would need to be dealt with when making this change (backed up, converted, deleted ?)

When I left my existing file as-is bit changed it to append using UTF-8, this is what I would find at the end of the file when new content got added after the change:

badEncoding

If I deleted the file entirely, the problem went away when it got re-created.

I think what we need to do:

Add a new method to Helpers.ps1. Something like Repair-LogFile.

  • It will make use of the idea from here to read the first 3 bytes of the existing log file (if it exists). If the bytes aren't @(239, 187, 191) (e.g. EF. BB. BF) then it's the wrong encoding.
  • In that scenario, we should move/backup the file and write a warning to the user to let them know that it happened.
$logPath = Get-Item -Path (Get-GitHubConfiguration -Name LogPath)
$backupFileName = $logPath.BaseName + ".backup" + $logPath.Extension
$backupPath = (Join-Path -Path $logPath.Directory -ChildPath $backupFileName)
Move-Item -Path $logPath -backupPath
Write-Log -Message "Log file encoding format has changed.  Old file has been moved to [$backupPath]." -Level Warning

Then I think a new script file needs to be added to the module (something like OnModuleLoad.ps1) and then have that added to ScriptsToProcess in the manifest. Have OnModuleLoad.ps1 call Repaid-LogFile. Then, the log check and migration will happen immediately upon being loaded.

Make sense?

X-Guardian

X-Guardian commented on Jun 21, 2020

@X-Guardian
ContributorAuthor

How about just converting the file, it is then transparent to the user? Do the same byte check then read the file and write back out with the required encoding if required.

I wouldn't use -Encoding utf8 as that add a BOM to the file. -Encoding ascii writes UTF8 without a BOM.

giuseppecampanelli

giuseppecampanelli commented on Jun 21, 2020

@giuseppecampanelli
Contributor

@X-Guardian wouldn't ASCII be an issue since the GitHub API can use Unicode characters?

HowardWolosky

HowardWolosky commented on Jun 21, 2020

@HowardWolosky
Contributor

How about just converting the file, it is then transparent to the user? Do the same byte check then read the file and write back out with the required encoding if required.

Generally speaking, I like that idea. I think a potential problem with it (albeit a one-time only problem) is if the file is substantially large (because it's never been pruned), it could take a while to read in and read out, which will silently block the user's console window with no clear explanation.

According to this article, when writing the file back out, we should use StreamWriter to achieve the fastest performance (tip of the hat to StackOverflow for that link.

Similarly, it appears that when trying to read the whole file in, you should probably use [System.IO.File]::ReadLines().

I wouldn't use -Encoding utf8 as that add a BOM to the file. -Encoding ascii writes UTF8 without a BOM.

I've just finished reading through lots of opinions on the BOM vs no BOM argument. I see the merit of avoiding the BOM, but I see no way of doing so while sticking strictly with PowerShell given the encoding options that v4/5 have (as opposed to 7 which does have a no BOM option).

Luckily, it looks like our preferred way of writing, based on that provided article, does provide a utf8 option that excludes the BOM....

$path = "c:\users\howard\desktop\test.txt"
$stream = [System.IO.StreamWriter]::new($path, [System.Text.Encoding]::UTF8)
$stream.Write('')
$stream.Close()

# This results in a 0-byte file which is also confirmed by:
Get-Content -Encoding Byte -ReadCount 3 -Path $path
# Returns nothing

# If we try again with a single character:
$stream = [System.IO.StreamWriter]::new($path, [System.Text.Encoding]::UTF8)
$stream.Write('a')
$stream.Close()
Get-Content -Encoding Byte -ReadCount 3 -Path $path

# That returns back 97, which is the letter 'a': https://www.codetable.net/decimal/97

Given this, I think the path forward is to have the Repair-LogFile method as I previously described. It should detect the first two bytes of the file. If they are @(255, 254), that means that it's the UTF-16 LE encoding from PS4/5. That's when you would read in the file with [System.IO.File]::ReadLines() and then write the whole content out again with StreamWriter.

Then you would also update Write-Log to use StreamWriter.WriteLine() to append the log file entry moving forward, as opposed to Out-File -Append. That way, we can ensure uniform encoding across all PowerShell versions moving forward.

@X-Guardian -- are you seeing issues with this plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThis relates to a bug in the existing module.in progressWork on this issue is already underway. Please don't work start new work on it.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @giuseppecampanelli@HowardWolosky@X-Guardian

      Issue actions

        PowerShellForGitHub.log File Character Encoding Differs Between PowerShell 5 and 7 · Issue #237 · microsoft/PowerShellForGitHub