The correct solution for GCC is specifying 1-byte alignment for this particular ...

stephencanon · on Nov 7, 2016

This is documented as not working[1] in all but the most recent GCC versions. E.g. gcc-5.4 documents:

> "The aligned attribute can only increase the alignment; but you can decrease it by specifying packed as well. See below."

but gcc-6.2 documentation adds:

> "When used as part of a typedef, the aligned attribute can both increase and decrease alignment, and specifying the packed attribute generates a warning."

FWIW, Clang has supported reducing alignment in this fashion for a few years now.

[1] it's possible (likely, even) that it has been working for a while but the documentation was only recently brought up to date; I haven't investigated too carefully.

qb45 · on Nov 7, 2016

Must have been supported for a while because this stuff is even used in Linux:

  #define __packed2__     __attribute__((packed, aligned(2)))
  
  /*
   * SystemV FS comes in two variants:
   * sysv2: System V Release 2 (e.g. Microport), structure elements aligned(2).
   * sysv4: System V Release 4 (e.g. Consensys), structure elements aligned(4).
   */
  
  struct sysv2_super_block {
          __fs16  s_isize;                /* index of first data zone */
          __fs32  s_fsize __packed2__;    /* total number of zones of this fs */

http://lxr.free-electrons.com/source/include/linux/sysv_fs.h...

codys · on Nov 7, 2016

Note the use of `packed`. `packed` causes the alignment to be set as small as possible (1 byte). The aligned(2) that follows then increases the alignment from 1 byte to 2 bytes.

qb45 · on Nov 7, 2016

Yeah, you're right. Add packed then if you want but I swear the code I posted works on 5.4 as it is.

speps · on Nov 7, 2016

In C++11 there is a standard for this : http://en.cppreference.com/w/cpp/language/alignas

qb45 · on Nov 7, 2016

I think it doesn't work in this case because

If the strictest (largest) alignas on a declaration is weaker than the alignment it would have without any alignas specifiers (that is, weaker than its natural alignment or weaker than alignas on another declaration of the same object or type), the program is ill-formed.

Also, using alignas on pointer variable will probably specify alignment of the pointer itself, not the pointed object. I suppose you can create a wrapper class around int with alignas(32) and specify the pointer as pointing to that, but this is going to be nasty given that you can't derive from int and have to write all those operators by hand.

d33 · on Nov 7, 2016

Anybody knows how Rust handles this problem?

alkonaut · on Nov 7, 2016

Not sure if this is what you are asking: Last time I tried alignment in Rust I worked around the lack of explicit alignment support by adding a zero length array of the correct size to the end of the struct. Not sure if alignment support from proper attributes has landed yet.

   [repr(C)]
   struct Something
   {
      pub foo: f32,
      pub _alignment: [EightBytes, 0]
   }

where "EightBytes" is a data type of size 8, to align the whole struct on 8 bytes.

steveklabnik · on Nov 7, 2016

Structs already insert padding to give them alignment:

    struct One {
      foo: u8,
    }
    
    struct Two {
      bar: u16,
    }
    
    struct Three {
      foo: u8,
      bar: u16,
    }
    
    struct Four {
      foo: u16,
      bar: u16,
    }
    
    fn main() {
        assert_eq!(1, std::mem::size_of::<One>());
        assert_eq!(2, std::mem::size_of::<Two>());
        assert_eq!(4, std::mem::size_of::<Three>());
        assert_eq!(4, std::mem::size_of::<Four>());
    }

alkonaut · on Nov 7, 2016

What I needed to do was to place e.g. a N byte struct exactly on an M byte alignment (e.g. 11 byte struct on 32 byte alignment etc).

steveklabnik · on Nov 7, 2016

Ah! Yeah, that's different, and as far as I know your way is the current right way to do it, but I'm not an expert on the subject.

Thiez · on Nov 7, 2016

Well given that Rust leaves struct layout undefined unless #[repr(C)] is specified, std::mem::size_of::<Three> is actually not guaranteed to be 4.

steveklabnik · on Nov 7, 2016

While this is true, it's a weird angle that's not itself super well defined. That is, the definition of repr(packed) implies that the alignment will always be there without it...

wyldfire · on Nov 7, 2016

I doubt you can count on all items being allocated at addresses that are multiples if their size.

It's not optimal but you can always use libc::posix_memalign()

alkonaut · on Nov 7, 2016

The final element of the struct is a zero length array of elements of size N bytes. So that element isn't padding, it has size 0! It's pure hinting. I'm not sure why or how this works under the hood I'm afraid. I used it successfully to call a library with pretty strict alignment requirements (Intel Embree).

Thiez · on Nov 7, 2016

The zero length array still must be aligned correctly. You can create a pointer to it (e.g. by taking &something._alignment[..] to create a slice) and pointers must have correct alignment, so it follows that a zero length array has the same alignment requirements as a longer array. So padding must be inserted in your struct so that the address of the zero-length array is correctly aligned.

alkonaut · on Nov 8, 2016

Hmm, I don't understand (sorry). The pointer to the internal array must be aligned. But how does the element size play into the padding. If we take these two examples?

   [repr(C)]
   struct StructA
   {
      pub foo: f32,
      _alignment: [SixteenBytes, 0]
   }

   [repr(C)]
   struct StructB
   {
      pub foo: f32,
      _alignment: [ThirtytwoBytes, 0]
   }

Are both just padded with 16-4 and 32-4 bytes respectively, so they are equivalent to making a padding like this in the first case?

   [repr(C)]
   struct StructAPadded
   {
      pub foo: f32,
      _padding : TwelveBytes;
   }

hsivonen · on Nov 7, 2016

Types in Rust have alignment as in C, and bad stuff may happen if you use unsafe code to violate alignment.

It seems that at present, if you want to generate an unaligned SSE2 load/store, you have to rely on modern memcpy (spelled copy_nonoverlapping in Rust) optimization. See the first reply to https://internals.rust-lang.org/t/unaligned-simd-sse2-in-par...

saynsedit · on Nov 7, 2016

Nah, correct solution is simply to use memcpy(), works on all compilers, all platforms, all versions, with SSE and with any flags specified:

  #include <stdlib.h>
  #include <stdint.h>

  uint64_t sum (char *p, size_t nwords)
  {
      uint64_t res = 0;
      size_t i;
      for (i = 0; i < nwords; i += 8) {
        uint64_t tmp;
        memcpy(&tmp, &p[i], sizeof(tmp));
        res += tmp;
      }
      return res;
  }

qb45 · on Nov 8, 2016

Nitpick: memcpy is string.h not stdlib.h, the type was uint32_t not uint64_t and you are making some unwarranted assumptions about sizeof(uint64_t), not to mention that the existence of this type is merely implementation defined ;)

Deal breaker: your memcpy invocation requires a sufficiently smart compiler to convert into normal unaligned load on x86 and seems to prevent GCC autovectorization. In this case OP actually didn't want vectorization, but in general it happens that such workarounds confuse compilers and produce worse code.

saynsedit · on Nov 8, 2016

I'm not sure I understand your deal breaker. For the platform he was targeting it produces optimal code, for other platforms it's merely slower (but not specifically slower, since the compiler is likely not a great optimizer across the board).

Vectorization is in general not applicable here since it usually requires aligned memory... not all implementations do, but most. In any case, benchmarking is more appropriate than armchair optimizing.

qb45 · on Nov 8, 2016

You are writing convoluted code and hoping that your compiler will figure it out and convert it internally to the form I posted. Sometimes it does, sometimes it doesn't. In this case it generates reasonable code but doesn't vectorize it for some reason. WTF.

I prefer to just add alignment specification and move on, assuming I don't care about portability. If portability matters, reread my original post ;)

saynsedit · on Nov 9, 2016

It's not convoluted. It's actually clear and well-defined making it easier to reason about.

I'd call compiler specific alignment attributes more arcane, convoluted, and susceptible to future bugs.

Vectorization isn't a panacea. You need to benchmark to be sure, lacking that I expect GCC to be better at optimizing code than you. If you disagree, please manually write a vectorized one that handles non-aligned addition and post your results :)

tropo · on Nov 7, 2016

You also need the may_alias attribute to prevent other problems.

Data written as int may be read as char, but going the other way is usually a standards violation. (an exception being if you had used char to implement a memcpy-like function, but in that case you should expect compiler bugs to bite you)

Thiez · on Nov 7, 2016

Implementing `memcpy` can be a challenge of its own. Gcc has a tendency to replace memcpy implementations with a call to... memcpy. Which is quite smart, just not in that particular context...

[1]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56888

qb45 · on Nov 8, 2016

Interesting, but is this a real world problem or just a standard nobody follows? I'm under impression that lots of networking and storage code running in the wild casts back and forth between char arrays and other types without giving it a thought.

And yes, as for other casts, I'm well aware that they cause problems.

MaulingMonkey · on Nov 7, 2016

Specifying alignment might make the crash go away. Like the "disable SSE" solution of the post, it does not make the undefined behavior go away and is not actually a solution.

qb45 · on Nov 7, 2016

If certain compilers provide functionality like __aligned__ or __packed__ it isn't undefined anymore on these particular compilers. I was specific about this being a GCC/clang thing (GCC to be exact, but somebody else confirmed clang).

MaulingMonkey · on Nov 7, 2016

__aligned__/__packed__ allow you to control alignment and packing, useful for e.g. avoiding false sharing in multithreaded code, and potentially useful if you're doing some of the type aliasing allowed by the standard - namely, type punning to/from char.

Unless __aligned__ or __packed__ is documented as also relaxing the strict aliasing rules - if such documentation exists, I haven't found it - there's still undefined behavior from the strict aliasing violation by type punning size_t <-> uint32_t.

Bad alignment is not the only possible "bad" optimization compilers can apply that relies on this being undefined behavior. For example, they may mistakenly assume a size_t value in memory can be cached in a register, and not reloaded from memory when it calls code that uses only uint32_t* pointers, because in a standards compliant program these cannot be used to modify size_t values.

You need __may_alias__ as well.

acqq · on Nov 7, 2016

Have you actually tried and verified that? Reading the docs I've had an impression that it says that the "aligned" is for the alignment inside of the structures, and you here declare the simple type?

qb45 · on Nov 7, 2016

I wouldn't have posted this without checking. And as far as documentation goes, they say this attribute applies to "types" and the example they give is

            struct S { short f[3]; } __attribute__ ((aligned (8)));

where it actually is used outside of a struct so that this whole 6B thing is aligned to 8B instead of the default 2B.

acqq · on Nov 7, 2016

Thanks! Are you able to post how the generated asm of their innermost loop looks like with SSE2 turned on? I mean the equivalent of theirs

    .L13:
         movdqa   (%r8), %xmm2
         ...

qb45 · on Nov 7, 2016

movdqu, obviously. And the code is shorter now, I guess it drops this part responsible for processing the first 1~3 elements of the array to reach 16B alignment.

Sagiri · on Nov 7, 2016

The correct and portable solution is to not use uint32_t, but to use uint8_t, and construct the integers manually.

stephencanon · on Nov 7, 2016

Well, portable except for `uint8_t` not being guaranteed to exist. =)

Thiez · on Nov 7, 2016

Which is a feature! If you use `char` instead of `uint8_t` your program would still compile on a system that doesn't have `uint8_t`, but it is likely to do something entirely unexpected. At least when you use `uint8_t` you are warned at compile-time that your program is broken.

Sagiri · on Nov 7, 2016

It isn't?

At least that would be a failure at compile-time, rather than run-time.

stephencanon · on Nov 7, 2016

`uint8_t` cannot exist on any platform with `CHAR_BIT > 8`. Such platforms are non-existent in the mainstream CPU world, but surprisingly common in the DSP world.

And yes, it's a compile-time failure, which is great. My comment should not be read as criticism at all (though I would likely use `uint16_t`, as the OP says the code is intended to work with [presumably aligned] 16-bit words).