Agree with most of above but:
Do not forget that BPLxDAT can be preloaded with fixed pattern and you may save some free cycles at a cost of computation complexity (but maybe memory can be saved at a cost of worse compression for picture - unavoidable increase in overall entropy) - for example Hires with 3 bitplanes and BPL4DAT can be preloaded with for example pattern $00FF so first 8 pixels will use color registers from 0 to 7 and next 8 pixels will use color registers from 8 to 15, side to this CLUT can be updated dynamically.

ST AFAIK can't do anything except moving data from source to target due heavy CPU usage (lack of dedicated HW - even overscan is possible due bug in graphic HW and need proper CPU cycling).

Also you mentioned important limitation for dynamic CLUT - it works well only with limited set of pictures - preferably highly detailed but limited overall tonal characteristic (unless serious breakthrough in conversion algorithms).

On AGA this limitation may be partially overworked by using CLUT switching (single write may switch group of registers - i still thinking on adding to OCS/ECS external SRAM addressed by Denise output and accessible trough one of not used chip-set address - RGA bus - having for example 32768 24 bit wide CLUT may be easiest way to extend Amiga OCS/ECS graphic capabilities).

Last one - ST can't do 640x512 in 16 colors due HW limitations and even marginally faster CPU can't do anything with this (additionally - most of those special modes on Amiga are insensitive to CPU speed - Dynamic Hires can work very nice with 68060).
