English Amiga Board


Go Back   English Amiga Board > Support > support.Apps

 
 
Thread Tools
Old 06 November 2020, 17:30   #1
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
SMBFS and UTF-8 Characters

Hi all!

On my Amiga, I use SMBFS to mount volumes that points to shared drives on my RPI, where I put all my MP3 and modules. It works well and, using the TRANSLATE option, I can read/write files back-and-forth, even with "typical" accented characters, like é or Ô, with: SMBFS USER=<username> PASSWORD=<password> DOMAIN=GIB SERVICE=//CHAMSAE/Music TRANSLATE=L:FileSystem_Trans/INTL.crossdos

Recently, I added MP3 files with names in Cyrillic or Hangul characters. These files are no problem for either the RPI itself or my Windows laptop because they both use UTF-8. But, these files to do not appear through SMBFS. I was expecting mangled file names, maybe like ????? ????.mp3, so I could have played them anyways , but they do not appear at all

Is there a way to "see" files with UTF-8 names using SMBFS?

Cheers!
tygre is offline  
Old 06 November 2020, 19:12   #2
nogginthenog
Amigan
 
Join Date: Feb 2012
Location: London
Posts: 1,309
Confirmed! I'm running a self-built SMBFS from Olaf's github dated 20th August.
Montréal works OK
nogginthenog is offline  
Old 06 November 2020, 20:18   #3
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Hi Nogginthenog!

Yes, Montréal also works for me... But try ????? ? - ???????.mp3 (Cyrillic), it doesn't show at all for me

And, it just happens that EAB also doesn't support UTF-8! What appears above as ????? ? - ???????.mp3 should be the title of [ Show youtube player ]!

Cheers!
tygre is offline  
Old 08 November 2020, 11:59   #4
Olaf Barthel
Registered User
 
Join Date: Aug 2010
Location: Germany
Posts: 532
Quote:
Originally Posted by tygre View Post
Hi all!

On my Amiga, I use SMBFS to mount volumes that points to shared drives on my RPI, where I put all my MP3 and modules. It works well and, using the TRANSLATE option, I can read/write files back-and-forth, even with "typical" accented characters, like é or Ô, with: SMBFS USER=<username> PASSWORD=<password> DOMAIN=GIB SERVICE=//CHAMSAE/Music TRANSLATE=L:FileSystem_Trans/INTL.crossdos

Recently, I added MP3 files with names in Cyrillic or Hangul characters. These files are no problem for either the RPI itself or my Windows laptop because they both use UTF-8. But, these files to do not appear through SMBFS. I was expecting mangled file names, maybe like ????? ????.mp3, so I could have played them anyways , but they do not appear at all

Is there a way to "see" files with UTF-8 names using SMBFS?

Cheers!
No, not yet. Unless you are forced to use an old Windows system or old Samba version ("old" meaning 20 years or older) the names of files and folders will be represented using 16 bit Unicode characters.

This is an option which smbfs (version 1.176 and beyond) can make good use of for the small portion of Unicode which maps exactly to the Amiga default character set (this being ISO 8859-1).

If this is all you need, then you don't even have to resort to the file name translation tables which use the original MS-DOS codepage-based scheme.

So, what about the remaining (roughly) 65,280 Unicode characters?

Because the Amiga cannot display these characters, smbfs will not attempt to return them during directory scanning and will not allow you to access them. The problem here is that these characters have no sound representation in the Amiga domain. Mapping them to UTF-8 sequences is tricky (what if you want to rename a file or directory?). Also, if you switch to UTF-8 then you would have to encode all characters except for those present in the US-ASCII 7 bit character set.

I have been pondering how to work around that problem since January this year and didn't make much progress beyond that, I'm afraid

The problem is solvable to some degree, but you'd still see the characters which don't fit the Amiga domain in an encoded form.

This will have its own drawbacks since you'd easily have to double or triple the length of the respective file or directory names. The upper limit for these names is 107 characters, and smbfs won't let you use names longer than that (they will not show up in directory lists and remain inaccessible).
Olaf Barthel is offline  
Old 08 November 2020, 16:54   #5
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Hi Olaf and thank you very much for your answer!

I see that that's a tricky problem... What about something like the MS-DOS/Windows "scheme": converting long file names into 8+3 chars? Let me explain

If I understand correctly, SMBFS knows that a file name contains chars beyond ISO-8859-1 and, wisely! , decides to hide it. Instead, could it convert the file name into a unique encoding of its own? For our beloved Amigas, SMBFS could keep the any ISO-8859-1 chars, up to 100 chars, and then append a unique number?

The unique number would distinguish two files that could have the same ISO-8859-1 chars by chance. (It could also be used internally by SMBFS to keep a correspondance between files "shown" on the Amiga side and the original files on the RPI although something more robust could be necessary.)

As with MS-DOS/Windows, this "scheme" would prevent renaming from the Amiga side but would allow accessing (yeah! ) and even moving.

Would that be possible? Could that create other problems?
Cheers!

Last edited by tygre; 08 November 2020 at 16:56. Reason: Fixed proposed naming scheme
tygre is offline  
Old 09 November 2020, 14:00   #6
Olaf Barthel
Registered User
 
Join Date: Aug 2010
Location: Germany
Posts: 532
Quote:
Originally Posted by tygre View Post
Hi Olaf and thank you very much for your answer!

I see that that's a tricky problem... What about something like the MS-DOS/Windows "scheme": converting long file names into 8+3 chars? Let me explain
This is a well-known technique, but if I remember correctly, once you request that file names should be returned in Unicode form, you lose access to the 8.3 format which the SMB server helpfully provides along with the legacy directory records.

Quote:
If I understand correctly, SMBFS knows that a file name contains chars beyond ISO-8859-1 and, wisely! , decides to hide it. Instead, could it convert the file name into a unique encoding of its own? For our beloved Amigas, SMBFS could keep the any ISO-8859-1 chars, up to 100 chars, and then append a unique number?
No, this is not a workable approach Files and folders are accessed through their complete path names, and if smbfs were to translate the path name components through a checksum scheme of its own making, then it would have to cache that information almost indefinitely. Also, it would have to cache how each checksummed path component relates to its parent.

The alternative is to rescan every directory that may contain encoded file or folder names upon access: the Samba server does that in order to allow for the 8.3 encoding to work but a client such as smbfs does not have this luxury. smbfs might have to ask the server over and over again for every directory that is part of a path.

Quote:
The unique number would distinguish two files that could have the same ISO-8859-1 chars by chance. (It could also be used internally by SMBFS to keep a correspondance between files "shown" on the Amiga side and the original files on the RPI although something more robust could be necessary.)

As with MS-DOS/Windows, this "scheme" would prevent renaming from the Amiga side but would allow accessing (yeah! ) and even moving.

Would that be possible? Could that create other problems?
Cheers!
The only scheme which would work for smbfs is one which requires no caching of the replacement name (even listing the contents of a directory would require adding a replacement for a Unicode file name to the cache, in case it's needed later).

So this would have to be a 1:1 mapping, I'm afraid

This could work, but it would have to use an "escape character" (or a sequence of characters) which indicate that what follows it is Unicode data. I think this could work if the encoded Unicode data could be stored in a compact form. UTF-8 showed how this could be done In order to keep the ISO 8859-1 Amiga file/drawer names I could use an "escape sequence" of two characters, for example.

The drawback still is that the file/drawer name length is limited to 107 characters, and with each Unicode character becoming "escape sequence"+2 or 3 encoded characters this may quickly exhaust the available space. And that's not even considering how full path names will work out. Many applications don't allow path names longer than 100-300 characters (and there are, of course, those which don't even check if the full path name fits into the buffer).

This remains a thorny problem Question is which trade-offs are acceptable. For example, how much memory may smbfs commit to lookup tables, or how often it may rescan directories.
Olaf Barthel is offline  
Old 09 November 2020, 23:43   #7
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Hi Olaf!

Quote:
The only scheme which would work for smbfs is one which requires no caching
I haven't read SMBFS source code so I'm certainly missing something here, sorry , but could you explain why caching would be a problem for SMBFS? (Except for memory consumption )

Quote:
but a client such as smbfs does not have this luxury. smbfs might have to ask the server over and over again for every directory that is part of a path.
Would that be so bad? (Except for speed and network traffic )

Quote:
once you request that file names should be returned in Unicode form, you lose access to the 8.3 format
Could you please maybe explain me (or point me in the source code) why the 8+3 format couldn't be made "compatible" with Unicode. My reasoning is: if SMB/SMBFS can match, for examples:

TextFile.Mine.txt <-> TEXTFI~1.TXT
ver +1.2.text <-> VER_12~1.TEX
.bashrc.swp <-> BASHRC~1.SWP


Then, couldn't it also match:

VilleDeMontréal.txt     <-> VilleDeMontréal.txt
XXXX - XXXXXX XXXXX.mp3 <-> ???? - ?????? ?????~1.mp3
XXX - XX.mp3 <-> ??? - ??~2.mp3


Where X is some Unicode code point and ? is just some ISO-8859-1 character chosen to replace any Unicode "code point" outside of the ISO-8859-1 chars. The ~ also could be different, maybe using \ to show that these files are for Amigas "consumption" only... Ironically, this is similar to what EAB does: when I copied/pasted file names with Cyrillic and Hangeul characters and saved my post, EAB replaced every "code point" (Cyrillic chars, Hangeul syllables) with "?" Wouldn't that work?

Cheers!

Last edited by tygre; 09 November 2020 at 23:53. Reason: Layout, typos, some more details
tygre is offline  
Old 10 November 2020, 04:52   #8
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
PS. I saw in proc.c the function

static int copy_utf16le_to_latin1(byte * to,int to_size,const byte * from,int len)


which is used in few places, into ifs like:

if(server->unicode_enabled)
{
copy_utf16le_to_latin1(finfo->complete_path, finfo->complete_path_size, name, name_len);
}


Could this function help?

I had tried setting dos charset = UTF-8 but maybe I made a mistake and should try with a different combination of other parameters?
tygre is offline  
Old 30 May 2021, 04:31   #9
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Hi all!

I'm happy to write that I have a proof-of-concept version of SMBFS that can handle files with non-Latin1 characters in their name (like the MP3 of this video: [ Show youtube player ])

It's rather "simplistic" right now: it replaces non-Latin1 names with a (unique) numerical name
but it could become smarter... Maybe using Jens' codesets.library?

I wonder if there is an interest (besides mine!) in improving this PoC?
Olaf, could I share with you my PoC, maybe via a pull request?

Cheers!
tygre is offline  
Old 08 June 2021, 19:38   #10
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
PS. Just sent a pull request to Olaf
tygre is offline  
Old 13 June 2021, 10:29   #11
Olaf Barthel
Registered User
 
Join Date: Aug 2010
Location: Germany
Posts: 532
Quote:
Originally Posted by tygre View Post
PS. Just sent a pull request to Olaf
Thank you very much! I'm looking at integrating the changes right now. There's still some left-over stuff in my working copy of post 2.22 changes that need some scrutiny before I can proceed.

Hang on...

And the changes are committed Version 2.23 is now tagged and ready for tinkering.

Last edited by Olaf Barthel; 13 June 2021 at 11:37.
Olaf Barthel is offline  
Old 13 June 2021, 11:43   #12
Olaf Barthel
Registered User
 
Join Date: Aug 2010
Location: Germany
Posts: 532
Quote:
Originally Posted by tygre View Post
Hi all!

I'm happy to write that I have a proof-of-concept version of SMBFS that can handle files with non-Latin1 characters in their name (like the MP3 of this video: [ Show youtube player ])

It's rather "simplistic" right now: it replaces non-Latin1 names with a (unique) numerical name
but it could become smarter... Maybe using Jens' codesets.library?

I wonder if there is an interest (besides mine!) in improving this PoC?
Olaf, could I share with you my PoC, maybe via a pull request?

Cheers!
A simple (not necessarily simplistic) solution is better than a clever solution which takes a long time to arrive

The best idea I had on how to achieve something similar would have involved encoding the Unicode characters in the drawer/file name. This would have required only small changes to smbfs, but it would have bumped against the name length limitations. With only 107 characters to work with and any encoding scheme taking up more than two characters to represent a 16 bit value, some names would never have fit. How do you show that directory entries have been omitted because of that? No idea The same problem already exists for file/drawer names longer than 107 characters.

Your solution does not have this problem by keeping the original name and its "alias" in memory. The extra memory spent will remain spent until smbfs shuts down, though. There's some room for improvement here, I'd say Also, your solution could be extended to file/drawer names longer than 107 characters which smbfs cannot currently represent. This looks like the way forward to me.

Last edited by Olaf Barthel; 13 June 2021 at 11:58.
Olaf Barthel is offline  
Old 16 June 2021, 19:07   #13
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Quote:
Originally Posted by Olaf Barthel View Post
Thank you very much! I'm looking at integrating the changes right now. There's still some left-over stuff in my working copy of post 2.22 changes that need some scrutiny before I can proceed.

Hang on...

And the changes are committed Version 2.23 is now tagged and ready for tinkering.
Nice, thank you!

Cheers!
tygre is offline  
Old 16 June 2021, 19:19   #14
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Hi Olaf!

Quote:
Originally Posted by Olaf Barthel View Post
A simple (not necessarily simplistic) solution is better than a clever solution which takes a long time to arrive

The best idea I had on how to achieve something similar would have involved encoding the Unicode characters in the drawer/file name. This would have required only small changes to smbfs, but it would have bumped against the name length limitations. With only 107 characters to work with and any encoding scheme taking up more than two characters to represent a 16 bit value, some names would never have fit. How do you show that directory entries have been omitted because of that? No idea The same problem already exists for file/drawer names longer than 107 characters.

Your solution does not have this problem by keeping the original name and its "alias" in memory. The extra memory spent will remain spent until smbfs shuts down, though. There's some room for improvement here, I'd say Also, your solution could be extended to file/drawer names longer than 107 characters which smbfs cannot currently represent. This looks like the way forward to me.
For the memory taken by the store, do you think that it could be freed between "uses" of SMBFS, like between a call to smb_proc_readdir_short() and one to smb_proc_open()?

Indeed the max. length of file names (and directory names) is really a hard constraint (in all sense of the term ). I actually limit the names to 31 characters because I met problems with the Ram Disk and some other programs... But these problems maybe came from my install?

Agreed on extending this solution for directory names!

Another thing I'd like to add is a real "transliteration" from Unicode to Latin1 but this seems complicated!

Cheers!
tygre is offline  
Old 17 July 2021, 05:28   #15
n9yty
Registered User
 
Join Date: Nov 2017
Location: Rockford IL / USA
Posts: 35
I haven't seen any binaries for releases of smbfs for quite a while, and I don't have a build environment set up. Is there somewhere to download them from?
n9yty is offline  
Old 18 July 2021, 03:53   #16
tygre
Returning fan!
 
tygre's Avatar
 
Join Date: Jan 2011
Location: Montréal, QC, Canada
Posts: 1,434
Hi n9yty!

No problem, I can share it with you... Where would be more convenient? The Zone maybe?

Let me know!
tygre is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to use SMBFS @UAE support.Apps 23 06 November 2021 19:57
Yet Another Help with SMBFS? tygre support.Apps 6 28 December 2019 20:38
SMBFS: Problems AMIGASYSTEM support.Apps 9 24 April 2018 23:35
Help with SMBFS? madman support.Apps 1 14 August 2011 19:32

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 20:21.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.10464 seconds with 15 queries