[uClinux-dev] ucLinux and XIP memory savings

Discussion:

Anna Fischer (novero/Bochum)

2013-07-04 12:32:22 UTC

Hi everyone,

I was wondering if anyone had any experiences they can share about using XIP on ucLinux? As far as I can see it is supported for the bootloader, the kernel and also for applications, if the toolchain supports this. My question though is if anyone has done any studies on how much memory is really saved doing XIP. Instead of copying executables to RAM, just .data and .bss sections of programs are copied to RAM. Also, boot time is usually reduced when using XIP. However, I'm quite keen to hear about some practical examples on this and some real numbers on what is exactly saved. Can anyone provide more information on this? I could not find any resources online.

Thanks for any pointers.

Anna

Wolfgang Wegner

2013-07-04 12:52:32 UTC

Permalink

Hi,

it is quite some time ago since I tried XIP on uClinux/Coldfire, but
could not get it to run. I do not recall which exact problem I had,
because I was fine without it in the end.

What is it you want to achieve? Is XIP the only possible option?

In my case, I was hoping to reduce startup and boot time in general,
and I ended up with a different compression scheme for the kernel
and rootfs (U-Boot images), because load time from flash and
decompression were what really made my system slow. And, even more,
flash access time becomes an even more critical factor when using
XIP.

On the other hand, my system was quite specialized with only one
application running being started on boot and only terminated in
case of power failure or watchdog reset. ;-) However, this might
be a common scenario for many embedded applications.

Just asking, because memory is not a cost factor any more in most
cases, so maybe focusing on XIP is not the way to go, especially
considering flash access being slower than DRAM in most cases
anyways.

Best regards,
Wolfgang

Anna Fischer (novero/Bochum)

2013-07-04 13:20:12 UTC

Permalink

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Hi,
it is quite some time ago since I tried XIP on uClinux/Coldfire, but
could not get it to run. I do not recall which exact problem I had,
because I was fine without it in the end.
What is it you want to achieve? Is XIP the only possible option?
In my case, I was hoping to reduce startup and boot time in general,
and I ended up with a different compression scheme for the kernel and
rootfs (U-Boot images), because load time from flash and decompression
were what really made my system slow. And, even more, flash access time
becomes an even more critical factor when using XIP.
On the other hand, my system was quite specialized with only one
application running being started on boot and only terminated in case
of power failure or watchdog reset. ;-) However, this might be a common
scenario for many embedded applications.
Just asking, because memory is not a cost factor any more in most
cases, so maybe focusing on XIP is not the way to go, especially
considering flash access being slower than DRAM in most cases anyways.

I'm considering XIP, because it could potentially allow me to run a small microcontroller with just internal SRAM (which is very little) instead of adding external SDRAM. I'm hoping that XIP could help me save RAM here, because I don't need to load any program code into RAM. I'm talking about a small microcontroller system here, so any saving on RAM are very beneficial.

In my case the access time to flash would be OK compared to external RAM. Especially when using a parallel NOR flash. There are D/I-caches on the chip as well and a tightly-coupled memory (TCM) module where critical code can be placed for low-latency access. So really, for now I just care about saving RAM as much as possible. XIP seems to help, but I'm looking for some numbers to help quantify the benefit.

Thanks,
Anna

Larry Baker

2013-07-04 15:02:25 UTC

Permalink

Anna,

It is a national holiday in the US, so I am out of the office until Monday when I will be able to send you more details.

I tried to use a Lantronix EDS2100 for an RS-232 data-logging application with remote access. That box has an M68K ColdFire processor, 8MB RAM, 8 MB flash. I used XIP and any other technique I could find to increase RAM. The biggest headache was the Linux 2.6 power-of-2 buddy system memory allocator. I guess in the 2.4 kernel, there was a boxcar memory allocator. That would have been better for such a small memory system. I had to resort to fixing GCC to try to catch stack overflow problems in standard apps (NTP, for time -- no RTC). But, I ran out of time to get the system to run reliably -- it kept locking up because of memory allocation failures due to the power-of-2 memory allocation scheme.

I have since discovered PlugPCs and similar systems based on ARM SoCs. Marvell's Kirkwood processors have ~512 MB RAM, and ~512MB-1GB flash -- plus an MMU (no FPU). I am currently prototyping a system using a Glomation GESBC-9G20. That box has an Atmel AT91 SAM processor, 64 MB RAM, 256 MB flash, MMU, no FPU. They have a very inexpensive board (US$35) on their web site (http://www.glomationinc.com) that you might consider. The one I chose has an RS-232 port and an enclosure and power supply, all for about US$100 in the quantities I'll need. The Atmel AT91 SAM processors are not bleeding edge ARM SoCs, but I am able to use a modern U-Boot and Linux 3.2 kernel. I haven't found a drop-in root FS small enough yet. Soon.

Larry Baker
US Geological Survey
650-329-5608
baker at usgs.gov

Post by Anna Fischer (novero/Bochum)
Hi everyone,
I was wondering if anyone had any experiences they can share about using XIP on ucLinux? As far as I can see it is supported for the bootloader, the kernel and also for applications, if the toolchain supports this. My question though is if anyone has done any studies on how much memory is really saved doing XIP. Instead of copying executables to RAM, just .data and .bss sections of programs are copied to RAM. Also, boot time is usually reduced when using XIP. However, I'm quite keen to hear about some practical examples on this and some real numbers on what is exactly saved. Can anyone provide more information on this? I could not find any resources online.
Thanks for any pointers.
Anna
_______________________________________________
uClinux-dev mailing list
uClinux-dev at uclinux.org
http://mailman.uclinux.org/mailman/listinfo/uclinux-dev
This message was resent by uclinux-dev at uclinux.org
http://mailman.uclinux.org/mailman/options/uclinux-dev

Anna Fischer (novero/Bochum)

2013-07-05 06:20:22 UTC

Permalink

Hi Larry,

Thanks for your response.

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Anna,
It is a national holiday in the US, so I am out of the office until
Monday when I will be able to send you more details.
I tried to use a Lantronix EDS2100 for an RS-232 data-logging
application with remote access. That box has an M68K ColdFire
processor, 8MB RAM, 8 MB flash. I used XIP and any other technique I
could find to increase RAM. The biggest headache was the Linux 2.6
power-of-2 buddy system memory allocator. I guess in the 2.4 kernel,
there was a boxcar memory allocator. That would have been better for
such a small memory system. I had to resort to fixing GCC to try to
catch stack overflow problems in standard apps (NTP, for time -- no
RTC). But, I ran out of time to get the system to run reliably -- it
kept locking up because of memory allocation failures due to the power-
of-2 memory allocation scheme.

Is this really still true? I had read somewhere that it is possible to replace the standard kernel memory allocator under ucLinux with one that is better suited to embedded systems, e.g. a block-based memory pool type allocator. I cannot find the reference anymore now though.

Also, I found in the kernel documentation that in the no-MMU configuration you can disable power-of-2 round-ups by setting sysctl `vm.nr_trim_pages' to 0. This would allow finer-grained memory allocation and help limit fragmentation.

I haven't used any of these myself, so any guidance on the suitability of those configuration options would be great.

Thanks,
Anna

Anna Fischer (novero/Bochum)

2013-07-05 07:01:31 UTC

Permalink

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Hi Larry,
Thanks for your response.

Actually I have just found that you are right and this was just the case for Linux 2.4 which is completely irrelevant now :-(

However, the power-of-2 memory allocation is a general problem for ucLinux which is independent of XIP, isn't it? XIP does not make this any worse, does it? I'd say it should be more the opposite as by using XIP we use less RAM in general and therefore memory fragmentation hopefully has a little less impact.

Anna

Larry Baker

2013-07-05 07:11:45 UTC

Permalink

Anna,

Post by Anna Fischer (novero/Bochum)

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Hi Larry,
Thanks for your response.

Yes. There was a non-power-of-2 user memory allocator in Linux 2.4 (a uClinux project?). That was never ported to 2.6 or later. When I looked at the code, I came to the conclusion that too many places in the kernel rely on the power-of-2 allocations. I could be wrong, and something like a Fibonacci-based buddy memory allocator might work, as long as the smallest memory allocation unit is at least the page size. It is an interesting academic exercise, made a bit irrelevant by the cheap ARM SoCs I've started using. I don't have enough lifetimes to pursue all the things that interest me. :)

Post by Anna Fischer (novero/Bochum)
XIP does not make this any worse, does it? I'd say it should be more the opposite as by using XIP we use less RAM in general and therefore memory fragmentation hopefully has a little less impact.

This was my result. At work I have my notes for the development effort on the Lantronix EDS2100. I employed every technique I could find to free RAM for use by the programs I needed for my application.

Post by Anna Fischer (novero/Bochum)
Anna

Larry Baker
US Geological Survey
650-329-5608
baker at usgs.gov

Anna Fischer (novero/Bochum)

2013-07-05 13:03:05 UTC

Permalink

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Anna,

Post by Anna Fischer (novero/Bochum)

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Hi Larry,
Thanks for your response.

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Anna,
It is a national holiday in the US, so I am out of the office until
Monday when I will be able to send you more details.
I tried to use a Lantronix EDS2100 for an RS-232 data-logging
application with remote access. That box has an M68K ColdFire
processor, 8MB RAM, 8 MB flash. I used XIP and any other technique
I could find to increase RAM. The biggest headache was the Linux
2.6
power-of-2 buddy system memory allocator. I guess in the 2.4
kernel, there was a boxcar memory allocator. That would have been
better for such a small memory system. I had to resort to fixing
GCC to try to catch stack overflow problems in standard apps (NTP,
for time -- no RTC). But, I ran out of time to get the system to
run reliably -- it kept locking up because of memory allocation
failures due to the
power-
of-2 memory allocation scheme.

though.

Post by Anna Fischer (novero/Bochum)
Actually I have just found that you are right and this was just the
case for Linux 2.4 which is completely irrelevant now :-(
However, the power-of-2 memory allocation is a general problem for

ucLinux which is independent of XIP, isn't it?
Yes. There was a non-power-of-2 user memory allocator in Linux 2.4 (a
uClinux project?). That was never ported to 2.6 or later. When I
looked at the code, I came to the conclusion that too many places in
the kernel rely on the power-of-2 allocations. I could be wrong, and
something like a Fibonacci-based buddy memory allocator might work, as
long as the smallest memory allocation unit is at least the page size.
It is an interesting academic exercise, made a bit irrelevant by the
cheap ARM SoCs I've started using. I don't have enough lifetimes to
pursue all the things that interest me. :)

Post by Anna Fischer (novero/Bochum)
XIP does not make this any worse, does it? I'd say it should be more

the opposite as by using XIP we use less RAM in general and therefore
memory fragmentation hopefully has a little less impact.
This was my result. At work I have my notes for the development effort
on the Lantronix EDS2100. I employed every technique I could find to
free RAM for use by the programs I needed for my application.

Have you experienced other limitations when using XIP? I assume that, for example, it can be harder to debug XIP code?

Thanks,
Anna

Larry Baker

2013-07-05 19:41:45 UTC

Permalink

Anna,

Post by Anna Fischer (novero/Bochum)

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Anna,

Post by Anna Fischer (novero/Bochum)

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Hi Larry,
Thanks for your response.

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Anna,
It is a national holiday in the US, so I am out of the office until
Monday when I will be able to send you more details.
I tried to use a Lantronix EDS2100 for an RS-232 data-logging
application with remote access. That box has an M68K ColdFire
processor, 8MB RAM, 8 MB flash. I used XIP and any other technique
I could find to increase RAM. The biggest headache was the Linux
2.6
power-of-2 buddy system memory allocator. I guess in the 2.4
kernel, there was a boxcar memory allocator. That would have been
better for such a small memory system. I had to resort to fixing
GCC to try to catch stack overflow problems in standard apps (NTP,
for time -- no RTC). But, I ran out of time to get the system to
run reliably -- it kept locking up because of memory allocation
failures due to the
power-
of-2 memory allocation scheme.

though.

Post by Anna Fischer (novero/Bochum)
XIP does not make this any worse, does it? I'd say it should be more

Have you experienced other limitations when using XIP? I assume that, for example, it can be harder to debug XIP code?

I did not write any code myself.

I had two chronic problems: running out of memory in large enough chunks
to connect an SSH session (cat /proc/buddyinfo) and program crashes, due
I believe to stack overflows. I fixed the compiler to catch stack overflows and
I patched BusyBox to enable printing fatal termination messages to stderr so
"echo 1 >/proc/sys/kernel/print-fatal-signals" would produce a register dump
when an illegal instruction exception occurred (a tell-tale sign of stack corruption).

The memory shortage was the biggest headache. uClinux appears to have
custom versions of common applications that have been cut down to fit in a
memory-constrained system. NTP is clearly not one of them (the code is pretty
sloppy with memory use). Not having an MMU makes it that much harder to
figure out that stack overflow is the problem. Maybe my patched GCC will
make that easier now for people that are using M68K processors, like ColdFire.

I don't think XIP makes debugging uClinux code any harder, unless you plan to
use gdb and you want to patch the running code (i.e., insert breakpoints).
I should think you could twiddle the XIP bit for the one executable you were
working on if you need to do that. I think there is a utility to manipulate uCLinux
flat binary header bits, but I don't remember what it is called.

Post by Anna Fischer (novero/Bochum)
Thanks,
Anna

I will send you off-list copies of my documentation for my uClinux work and my GCC fixes.

Larry Baker
US Geological Survey
650-329-5608
baker at usgs.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.uclinux.org/pipermail/uclinux-dev/attachments/20130705/997fa468/attachment.html>

Ted Ma

2013-07-05 20:22:08 UTC

Permalink

Hi Larry,
flthdr is the utility to manipulate the header
====
The flat format also defines the stack size for an application as a field
in the flat header. To increase the stack allocated to an application, a
simple change of this field is all that is required. This can be done with
the flthdr command, like this:
flthdr -s flat-executable

The flat format also allows two compression options. The entire
executable can be compressed, providing maximum ROM savings. It also
offers the often useful side effect that the application is loaded
entirely into a contiguous RAM block. You also may choose
data-segment-only compression. This is important if you want to save ROM
space but still want the option to utilize XIP. The following:

flthdr -z flat-executable
creates a fully compressed executable, and

flthdr -d flat-executable
compresses only the data segment.
====
...MaTed

Post by Larry Baker
Anna,

Post by Anna Fischer (novero/Bochum)

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings

==== SNIP ====

Post by Larry Baker
working on if you need to do that. I think there is a utility to manipulate uCLinux
flat binary header bits, but I don't remember what it is called.

Post by Anna Fischer (novero/Bochum)
Thanks,
Anna

I will send you off-list copies of my documentation for my uClinux work and my GCC fixes.
Larry Baker
US Geological Survey
650-329-5608
baker at usgs.gov

--
____________________________________
Ted Ma
Arcturus Networks Inc.
300-701 Evans Ave.
416 621-0125 x231
Toronto, Ontario
M9C 1A3

Anna Fischer (novero/Bochum)

2013-07-08 10:09:44 UTC

Permalink

Subject: Re: [uClinux-dev] ucLinux and XIP memory savings
Hi Larry,
flthdr is the utility to manipulate the header ==== The flat format
also defines the stack size for an application as a field in the flat
header. To increase the stack allocated to an application, a simple
change of this field is all that is required. This can be done with the
flthdr -s flat-executable
The flat format also allows two compression options. The entire
executable can be compressed, providing maximum ROM savings. It also
offers the often useful side effect that the application is loaded
entirely into a contiguous RAM block. You also may choose data-segment-
only compression. This is important if you want to save ROM space but
flthdr -z flat-executable
creates a fully compressed executable, and
flthdr -d flat-executable
compresses only the data segment.
====

Is there a limit on any of the sections (.text,.bss,.data) when using PIC on ARM architectures? As far as I have read there used to be a size limitation for .data+.bss on m68k architectures of 32K and for .text there is a limit of 64K. This only applies to the PIC executable format.
Are there any similar restrictions for XIP on ARM?

Thanks.

Larry Baker

2013-07-05 07:04:28 UTC

Permalink

Anna,

Post by Anna Fischer (novero/Bochum)
Hi Larry,
Thanks for your response.

I believe these are references to the choice of SLAB/SLOB/SLUB for the kernel memory allocator. The trouble I ran into are with the user memory allocator. It is my impression that the kernel memory allocator calls the user memory allocator for big chunks, then uses its own SLAB/SLOB/SLUB allocation strategy to hand out to kernel clients.

Post by Anna Fischer (novero/Bochum)
Also, I found in the kernel documentation that in the no-MMU configuration you can disable power-of-2 round-ups by setting sysctl `vm.nr_trim_pages' to 0. This would allow finer-grained memory allocation and help limit fragmentation.

I've not heard of this. In https://www.kernel.org/doc/Documentation/sysctl/vm.txt it says

Post by Anna Fischer (novero/Bochum)
==============================================================
nr_trim_pages
This is available only on NOMMU kernels.
This value adjusts the excess page trimming behaviour of power-of-2 aligned
NOMMU mmap allocations.
A value of 0 disables trimming of allocations entirely, while a value of 1
trims excess pages aggressively. Any value >= 1 acts as the watermark where
trimming of allocations is initiated.
The default value is 1.
See Documentation/nommu-mmap.txt for more information.
==============================================================

In https://www.kernel.org/doc/Documentation/nommu-mmap.txt it says

Post by Anna Fischer (novero/Bochum)
=================================
ADJUSTING PAGE TRIMMING BEHAVIOUR
=================================
NOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
when performing an allocation. This can have adverse effects on memory
fragmentation, and as such, is left configurable. The default behaviour is to
aggressively trim allocations and discard any excess pages back in to the page
allocator. In order to retain finer-grained control over fragmentation, this
behaviour can either be disabled completely, or bumped up to a higher page
watermark where trimming begins.
Page trimming behaviour is configurable via the sysctl `vm.nr_trim_pages'.

I think

1) is not a different user memory allocator -- memory is still always allocated in powers-of-2;
2) the "excessive"trimming would result in small fragments for the life of the larger allocation;
3) if none of those smaller fragments happen to get allocated in the mean time, when the larger
allocation is released, they will all get agglomerated into the original block;
4) if one of those smaller fragments happens to get allocated in the mean time, and is held for a
long time, that has the negative effect that can be overcome using vm.nr_trim_pages.

Still, it might be worth experimenting with on a memory constrained system. I ran out of time to
continue butting my head against the limitations of a noMMU system, I had to put that effort aside.

In addition to the problems with large enough chunks of free user memory disappearing over time,
I also had stack overflow problems with applications like NTP. An MMU Linux automatically extends
the user stack when it overflows (to a limit). Even though GCC says it supports stack limit checking
for M68K processors, it causes an Internal Compiler Error for M68000 processors, such as ColdFire.
(See GCC 28896 at Bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28896.) I added support to
GCC for M68000 processors, and fixed the other bugs I found for stack limit checking. (Patches and
build instructions are also at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=28896.)

Post by Anna Fischer (novero/Bochum)
I haven't used any of these myself, so any guidance on the suitability of those configuration options would be great.
Thanks,
Anna

Larry Baker
US Geological Survey
650-329-5608
baker at usgs.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.uclinux.org/pipermail/uclinux-dev/attachments/20130705/7f9682ef/attachment.html>

Gavin Lambert

2013-07-04 22:33:27 UTC

Permalink

Post by Anna Fischer (novero/Bochum)
I was wondering if anyone had any experiences they can share about using
XIP on ucLinux? As far as I can see it is supported for the bootloader,
the kernel and also for applications, if the toolchain supports this. My
question though is if anyone has done any studies on how much memory is
really saved doing XIP. Instead of copying executables to RAM, just
.data and .bss sections of programs are copied to RAM. Also, boot time
is usually reduced when using XIP. However, I'm quite keen to hear about
some practical examples on this and some real numbers on what is exactly
saved. Can anyone provide more information on this? I could not find any
resources online.

Many years ago, I had a look at XIP (but couldn't actually use it in
practice because the bootloader I was using required the romfs to be
compressed).

During my investigation, I concluded that its primary benefit would be for
running multiple copies of busybox (which happens quite often when running
shell scripts etc) and in minimising memory fragmentation when running lots
of little programs. One downside is that flash is typically slower than
RAM, so actual execution would be slowed down a bit. (As a sidenote,
another [even more tiny and barebones] device I've used intentionally copies
and runs from RAM even though XIP is available because the flash is too slow
for the firmware's requirements.)

At the end of the day though my particular usage case didn't require lots of
little programs, and it was easy to add more RAM, so I didn't worry about
XIP too much. But YMMV.