Monday, April 22, 2019

windows 8.1 - How can I fix broken shift-JIS filenames?


I've got some files with shift-jis filenames in ANSI.
e.g.


home_03@‚¢ƒgƒ‰ƒ“ƒNŠJ‚¢‚½Aƒtƒ@ƒCƒ‹—L‚è

when they should be in shift-jis like


home_03@青いトランク開いた、ファイル有り

This is because the archive extractor I'm using doesn't support shift-jis. That can't really be helped. But is there a way to fix the filenames of the files I've extracted?


edit:


another example


Ší‹ï‘ä@ƒXƒpƒi

should be


器具台@スパナ

Answer



Since you're using Windows, PowerShell is probably the easiest method.


Now, PowerShell internally uses UTF-16 for its strings, so a conversion would involve four steps:



  1. Read the incorrect filename from the filesystem into PS (represented internally as a UTF-16 string)

  2. Tell PS to convert the string into a raw byte array as if the string were . We can't use the PS string directly (as it's UTF-16).

  3. Tell PS to convert the byte array back to a string interpreting it as . This will give use a UTF-16 string of the raw bytes interpreted as Shift-JIS.

  4. Rename the file


Let's start by defining the encodings. In your case, I'm guessing your source is Windows-1252 (default non-Unicode codepage for Western/English Windows).


$srcEnc = [System.Text.Encoding]::GetEncoding("Windows-1252")
$destEnc = [System.Text.Encoding]::GetEncoding("Shift-JIS")

You could also use [System.Text.Encoding]::Default to get the current system codepage but I prefer to be explicit.


Then we apply the conversion steps:


$newName = $destEnc.GetString($srcEnc.GetBytes($oldName))

In your example, home_03@‚¢ƒgƒ‰ƒ“ƒNŠJ‚¢‚½Aƒtƒ@ƒCƒ‹—L‚è becomes home_03@ツいトランク開いたAファイル有り. While this is different from your example result (see notes at bottom), it matches what I get from http://string-functions.com/encodedecode.aspx's Windows-1252 => Shift-JIS. If this is incorrect, you may have to play around until you find the correct source and destination encodings.


Putting it together with a standard loop:


$srcEnc = [System.Text.Encoding]::GetEncoding("Windows-1252")
$destEnc = [System.Text.Encoding]::GetEncoding("Shift-JIS")
Get-ChildItem | %{Rename-Item -LiteralPath "$_" "$($destEnc.GetString($srcEnc.GetBytes($_.Name)))"}

Or if you prefer to recurse into subdirectories:


$srcEnc = [System.Text.Encoding]::GetEncoding("Windows-1252")
$destEnc = [System.Text.Encoding]::GetEncoding("Shift-JIS")
Get-ChildItem -Recurse | %{Rename-Item -LiteralPath "$_" "$($destEnc.GetString($srcEnc.GetBytes($_.Name)))"}

Add -File to Get-ChildItem if you want to avoid renaming directories.




Looks like your example included two characters that were invalid in Windows-1252 and were likely dropped when you posted the question (based on reversing the process using your example output). There's a 144 (0x90) between the first @ and Â, and a 129 (0x81) between the ½ and A. For the convenience of anyone else looking to test, here's a base64-encoded version of the raw bytes: aG9tZV8wM0CQwoKig2eDiYOTg06KSoKigr2BQYN0g0CDQ4OLl0yC6A==.




Also note that this will not work when there are characters Windows considers invalid in either your source or destination filenames. Especially in the source filename, as your extraction tool probably would have irrecoverably mangled the name on extraction (by dropping the bytes corresponding to the invalid characters like ? or \ in the wrong encoding). The only thing you can do in those cases is use an alternative extraction tool that avoids this problem entirely.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...