<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>OwenShepherd.net</title>
	<atom:link href="http://www.owenshepherd.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.owenshepherd.net</link>
	<description>The low-level world</description>
	<lastBuildDate>Tue, 03 Jan 2012 20:52:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Thoughts on the Python object model</title>
		<link>http://www.owenshepherd.net/2012/01/03/thoughts-on-the-python-object-model/</link>
		<comments>http://www.owenshepherd.net/2012/01/03/thoughts-on-the-python-object-model/#comments</comments>
		<pubDate>Tue, 03 Jan 2012 20:52:52 +0000</pubDate>
		<dc:creator>Owen Shepherd</dc:creator>
				<category><![CDATA[Languages]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[MRO]]></category>
		<category><![CDATA[Objects]]></category>

		<guid isPermaLink="false">http://www.owenshepherd.net/?p=25</guid>
		<description><![CDATA[So, it&#8217;s been over a year since my last post? Well, I didn&#8217;t give an SLA when I created this blog, and I didn&#8217;t give a defined set of topics that would be covered either. Anyhow, this is going to &#8230; <a href="http://www.owenshepherd.net/2012/01/03/thoughts-on-the-python-object-model/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>So, it&#8217;s been over a year since my last post? Well, I didn&#8217;t give an SLA when I created this blog, and I didn&#8217;t give a defined set of topics that would be covered either. Anyhow, this is going to be a little higher level than my previous programming posts.</p>
<h1>The Python Object Model</h1>
<p>At the core of the Python object model are two types: <em>type</em> and <em>object</em>. They have a curious relation: <em>type</em> is derived from  <em>object</em>, and <em>object</em> is an instance of <em>type</em>. In fact, <em>type</em> is also an instance of <em>type</em>. To look through the object model, we are going to walk through the process of creating a new class step by step, by invoking the type method (Python class definitions are just syntactic sugar for this):</p>
<ul>
<li>We&#8217;ll invoke <em>T = type(&#8220;T&#8221;, (object,), {})</em>, creating a new type named &#8220;T&#8221;, derived from <em>object</em>, and with no new methods</li>
<li><em>type</em> is an object, so we will invoke the <em>__call__</em> operator. Operators are always looked up on the object&#8217;s class, which in this case is <em>type</em> itself.</li>
<li>Therefore, we will be invoking <em>__getattribute__</em> on <em>type</em> for <em>__call__</em>. <em>type</em> defines a <em>__call__</em> method, so <em>__getattribute__</em> doesn&#8217;t have to do any complex lookups.</li>
<li><em>__call__</em> is a function, and all functions are descriptors, so it is time for an interlude&#8230;</li>
</ul>
<h1>Descriptors</h1>
<p>Descriptors are also objects. The most well known type of descriptor is that returned by <em>property</em> (<em>property</em> is actually a class!), but <em>classmethod</em> and <em>staticmethod</em> also return descriptors. Descriptors come in two types:</p>
<ul>
<li>Non-data descriptors, which implement the <em>__get__</em> method, and which are only invoked if a member of the object&#8217;s class, and are overriden by members on the object&#8217;s instance dictionary</li>
<li>Data descriptors, which implement the <em>__set__</em> method, override accesses to the instance dictionary</li>
</ul>
<p>If a descriptor is looked up, then the method corresponding to the action will be invoked</p>
<ul>
<li><em>__getattribute__</em> -&gt; <em>__get__</em></li>
<li><em>__setattr__ -&gt; __set__</em></li>
<li><em>__delattr__ -&gt; __del__</em></li>
</ul>
<p>So, a function is a descriptor. It&#8217;s irrelevant to this case, but it is actually a non-data descriptor. So&#8230;</p>
<ul>
<li><em>__get__</em>returns a bound method (Ever wondered where those were created? Now we know!), which fills in the <em>self</em> parameter for calls</li>
<li>Its now time to call the bound method. Now, as functions (and bound methods) are actually objects, we could follow the above again. However, it is easy to see how this could result in infinite recursion; fortunately, this doesn&#8217;t happen: Python recognizes that we are calling a function, and we don&#8217;t have to repeat the above</li>
<li>Now we begin standard class construction: we will invoke type.__new__(type, (object,), {}) followed by a do-nothing type.__init__ method.</li>
</ul>
<p>Constructing a new class is pretty simple: We create a new object with its instance dictionary set to the members dictionary passed as type&#8217;s third argument.</p>
<h1>Making an instance of an object</h1>
<p>Admittedly, we made an instance of an object above; but the case of creating a new type is rather different from the normal (because <em>type</em> is its own type). So, lets create a new instance of the type <em>T</em> we created above</p>
<ul>
<li>We&#8217;ll do <em>O = T()</em></li>
<li><em>T</em> is again an object, so we will invoke <em>__call__</em> again. Again, <em>T</em> is specifically a <em>type</em>, so we will be invoking <em>type.__call</em>__(<em>T</em>)</li>
<li><em>type.__call__(T)</em> will look up <em>T.__new__</em>. <em>T</em> doesn&#8217;t define a new method, so we will look in the next type in the <a href="http://www.python.org/getit/releases/2.3/mro/">method resolution order</a>. Since <em>T</em> singly inherits (from <em>object</em>), we will be looking up <em>__new__</em> in <em>object</em>.</li>
<li><em>object.__new__(T)</em> will create a new instance of <em>T</em>, and set it up as a normal Python object (as you might expect from <em>object.__new__(T)</em>)</li>
<li><em>type.__call__(T)</em> will next invoke <em>object.__init__(T)</em>, for the same reasons. Object&#8217;s <em>__init__</em> method does nothing, so nothing of interest happens here</li>
<li><em>type.__call__(T)</em> will return the newly created object</li>
</ul>
<h1>Summarizing it all</h1>
<p>We can see that, from a couple of rules, we can derive the Python object system:</p>
<ul>
<li>We can separate lookups  into two types: <em>operator lookups</em> and <em>normal lookups</em></li>
<li>Operator lookups always look up on the type of the object</li>
<li>Normal lookups look up first on the object, and then on the type</li>
<li>Lookups always involve invoking the <em>__getattribute__</em> operator</li>
<li>If a data descriptor is found during a lookup, it is always invoked</li>
<li>If a non-data descriptor is found during a lookup, it is only invoked if it was found somewhere but the instance dictionary of the object being interrogated</li>
</ul>
<p>At its core, the Python object model is incredibly simple (excluding some complexity around descriptors, which is, in my opinion, justified because it makes <em>using it</em> more inituitive). It is, however, somewhat hard to inspect from outside; the core workings of it are not particularly brilliantly documented and are implemented in C, and some of the interactions are hard to identify without re-implementing it yourself (In fact, much of this was discerned by reimplementing it in Lua)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.owenshepherd.net/2012/01/03/thoughts-on-the-python-object-model/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The releng-0.5 branch&#8230; is open</title>
		<link>http://www.owenshepherd.net/2010/08/18/the-releng-0-5-branch-is-open/</link>
		<comments>http://www.owenshepherd.net/2010/08/18/the-releng-0-5-branch-is-open/#comments</comments>
		<pubDate>Wed, 18 Aug 2010 21:24:54 +0000</pubDate>
		<dc:creator>Owen Shepherd</dc:creator>
				<category><![CDATA[EForge]]></category>
		<category><![CDATA[Django]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Release]]></category>

		<guid isPermaLink="false">http://www.owenshepherd.net/?p=22</guid>
		<description><![CDATA[EForge is a project of mine &#8211; a project to build a better project management system. The aim is quite simple: Combine the best features of systems like Trac, RedMine and SourceForge, with the best features of systems like GitHub &#8230; <a href="http://www.owenshepherd.net/2010/08/18/the-releng-0-5-branch-is-open/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://eforge.e43.eu/p/eforge/">EForge</a> is a project of mine &#8211; a project to build a better project management system. The aim is quite simple: Combine the best features of systems like Trac, RedMine and SourceForge, with the best features of systems like GitHub and Gitorious.</p>
<p>Its not a simple process by a long shot. There is a long road ahead of us.</p>
<p>But we just moved one small &#8211; but oh so very important &#8211; step closer. EForge is moving to its first release, 0.5.</p>
<h2>What is EForge today?</h2>
<p><span id="more-22"></span>OK, firstly: EForge is not a complete replacement for any of those systems yet. So far, it is most of the way to replacing Trac or Redmine. EForge offers</p>
<ul>
<li>A project wiki, with history tracking (as all good wikis must have)</li>
<li>A pretty competent issue tracker. Its still missing some features &#8211; most notably sorts &#8211; but its getting there.</li>
<li>A source code browser for Git or Hg repositories. There are a few features missing, but the browser is already a good way of looking around your repository</li>
</ul>
<p>Now, I should point something out: <strong>0.5 is not really for production use. </strong>We expect that to happen more around the 0.6 or 0.7 time frame. 0.5 is just a preview release; a stabilised version for people to have a play around with.</p>
<p>Now, that said: EForge already works well enough for us to dogfood ourselves on it.</p>
<h2>Where are we going?</h2>
<p>There are a number of features we intend to add soon:</p>
<ul>
<li>Gitosis/Gitolite style user management. This is the &#8220;Common SSH account, various keys&#8221; system that you may be familiar with from services like Gitorious and GitHub</li>
<li>Sorting in the bug tracker</li>
<li>Subprojects</li>
<li>Integration with Django&#8217;s sites framework (i.e. projects can be assigned to sites &#8211; so you can have eforge1.com and eforge2.com and they&#8217;ll be fully interlinked)</li>
<li>User private clones of projects</li>
<li>Gitorious style merge requests</li>
<li><em>probably other things too <img src='http://www.owenshepherd.net/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </em></li>
</ul>
<h2>Helping out</h2>
<p>If you have any features you want &#8211; please log them on our <a href="http://eforge.e43.eu/p/eforge/tracker/">tracker</a>. If you want the source &#8211; Git instructions are <a href="http://eforge.e43.eu/p/eforge/wiki/">on the wiki</a>. If you&#8217;d like to contribute &#8211; either post a patch to the tracker, or set up a Git repository somewhere and point us to it.</p>
<p>Its still early days, and there is a lot of work ahead of us.</p>
<p>But a massive milestone has been reached. The future is bright.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.owenshepherd.net/2010/08/18/the-releng-0-5-branch-is-open/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The magical Futex</title>
		<link>http://www.owenshepherd.net/2010/08/11/the-magical-futex/</link>
		<comments>http://www.owenshepherd.net/2010/08/11/the-magical-futex/#comments</comments>
		<pubDate>Wed, 11 Aug 2010 16:54:50 +0000</pubDate>
		<dc:creator>Owen Shepherd</dc:creator>
				<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Threading]]></category>

		<guid isPermaLink="false">http://www.owenshepherd.net/?p=18</guid>
		<description><![CDATA[Its rare that computing creates something as elegant as the Futex: A simple and highly elegant system on top of which all the important synchronization primitives can be built, which has minimal overhead, and which is screamingly fast. Its even &#8230; <a href="http://www.owenshepherd.net/2010/08/11/the-magical-futex/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Its rare that computing creates something as elegant as the Futex: A simple and highly elegant system on top of which all the important synchronization primitives can be built, which has minimal overhead, and which is screamingly fast.</p>
<p>Its even better when they work everywhere at the same efficiency &#8211; that is, whether placed in the process&#8217; local memory, or a shared memory segment. And its such a shame that nobody has implemented them outside of Linux.</p>
<p><span id="more-18"></span></p>
<h1>So, what is a Futex?</h1>
<p>A futex is actually surprisingly simple. Creating one doesn&#8217;t require any system calls, and is in fact most trivial.</p>
<p style="text-align: center;"><em>A futex is just an aligned, normally 32-bit, integer </em></p>
<p style="text-align: left;">Its truly quite simple. Now, this isn&#8217;t the whole story: you can&#8217;t implement an efficient mutex on just an integer. Its just that the other involved structures (such as wait queues) only get involved when someone needs to wait for the futex. Until then, it is just a piece of memory that someone allocated</p>
<p style="text-align: left;">(In case you&#8217;re wondering how they work in shared memory: Futexes are keyed upon their <em>physical</em> address)</p>
<h1>If its so simple, how do I use one?</h1>
<p>A futex is just a counter. In fact, it can be said that a futex supports just two operations: &#8220;down&#8221; and &#8220;up&#8221;; which decrement and increment it respectively. To begin this illustration, we are going to begin with the semaphore, which is nothing more than a generalisation of a mutex.</p>
<p>To begin, allocate your futex somewhere, and set its value to the maximum value of the semaphore. In other words, if you want to be able to acquire the Semaphore 3 times, set it to 3. Obviously, for a mutex you would set it to 1.</p>
<p>Now, to acquire the semaphore: Perform an atomic decrement and fetch. Look at the value you just set the semaphore to; if this is positive or zero, you succeeded, if it is negative then ask the kernel to block you, and we will get to that in a moment.</p>
<p>How about releasing the semaphore? This is also simple; do the opposite: an atomic fetch and increment. Again look at the value returned; if it was positive or zero, you&#8217;re done; and otherwise you need set the counter to one, then ask the kernel to wake people up.</p>
<h1>Getting the Kernel involved</h1>
<p>OK, so far we have been working entirely in user space; but we can&#8217;t have an efficient mutex without the kernel. Lets get on to that.</p>
<p>Firstly, blocking; blocking is done by the FUTEX_WAIT method. You pass the address of the futex and the value the decrement operation returned to the kernel, and it atomically checks that the futex still has the value you passed and blocks you</p>
<p style="padding-left: 30px;">This check is required in order to avoid a race condition: Another process ups the futex before you get into the kernel. In the case the value has changed, you must spin around and retry to acquire again.</p>
<p>Once you have been unblocked, you need to attempt to lock the futex again.</p>
<p>Now, unblocking: this is done by the FUTEX_WAKE method. You pass the futex address, and the number of threads you wish to wake; 1 is obviously a good value most of the time (and should also be most fair).</p>
<p>There, that was simple enough, wasn&#8217;t it?</p>
<h1>Expanding: Condition Variables</h1>
<p>OK, that was pretty simple; but what about condition variables? They can be implemented in a rather simple manner too:</p>
<ul>
<li>Start the Futex at 0</li>
<li>Down it each time a thread waits on it</li>
<li>Up it each time you want to wake a single thread</li>
<li>Exchange the current value for 0, then wake up -the_previous_value processes to wake up everyone</li>
</ul>
<p>See? Those are pretty simple too!</p>
<h1>In closing: The issues, and how to make them better</h1>
<p>As you can see, Futexes are both delightfully simple, and beautifully efficient. However, there is an area I can see for improvement. It also corrects what I call a hole in POSIX: Or, that mutexes cannot be converted to file descriptors (To wait on using mechanisms like select).</p>
<p>(In fact, it is interesting to note that there was in the past an attempt to add  this feature, FUTEX_FD, but it was racy).</p>
<p>I would add a single method, FUTEX_WAIT_FD. The behaviour of this is as follows:</p>
<p style="padding-left: 30px;">FUTEX_WAIT_FD will atomically compare the value of the futex with the previous value (as passed in a parameter), and build a file descriptor. The returned file descriptor is read only; attempts to write to it will return an error. When another thread or process calls FUTEX_WAKE on the futex, this will cause the file descriptor to become readable; reading it will return the number of waiters that FUTEX_WAKE was asked to wake up. Once this file descriptor has become readable, the thread must re-attempt to lock the futex in the same way as is done when woken from FUTEX_WAIT. After the file descriptor has become readable, it will not be activated again and must be closed.</p>
<p>I can also see other possibilities for improving the futex system (and other locking systems in general); obviously, they should be adapted to the operating system that is using them.</p>
<h2>Further Reading</h2>
<ul>
<li><a href="http://www.kernel.org/doc/man-pages/online/pages/man2/futex.2.html">man futex(2)</a></li>
<li><a href="http://www.kernel.org/doc/man-pages/online/pages/man7/futex.7.html">man futex(7)</a></li>
<li><a href="http://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf">Fuss, futexes and furwocks: Fast Userlevel Locking in Linux</a> (PDF, by Hubertus Franke, Rusty Russell, Matthew Kirkwood)</li>
<li><a href="http://dept-info.labri.fr/~denis/Enseignement/2008-IR/Articles/01-futex.pdf">Futexes are tricky</a> (PDF, by Ulrich Drepper, going into how to use them, and how Glibc uses them, in depth)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.owenshepherd.net/2010/08/11/the-magical-futex/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I believe in UDI</title>
		<link>http://www.owenshepherd.net/2010/07/23/i-believe-in-udi/</link>
		<comments>http://www.owenshepherd.net/2010/07/23/i-believe-in-udi/#comments</comments>
		<pubDate>Fri, 23 Jul 2010 21:54:12 +0000</pubDate>
		<dc:creator>Owen Shepherd</dc:creator>
				<category><![CDATA[Operating Systems]]></category>
		<category><![CDATA[UDI]]></category>

		<guid isPermaLink="false">http://www.owenshepherd.net/?p=15</guid>
		<description><![CDATA[UDI is the Uniform Driver Interface. It provides a standard, high performance interface for operating system device drivers, and is standardised at all the important levels &#8211; both API and ABI. This means you can take a driver &#8211; regardless &#8230; <a href="http://www.owenshepherd.net/2010/07/23/i-believe-in-udi/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>UDI is the Uniform Driver Interface. It provides a standard, high performance interface for operating system device drivers, and is standardised at all the important levels &#8211; both API and ABI. This means you can take a driver &#8211; regardless of if you have the source for it or not &#8211; and use it with any UDI supporting system. There is only one problem.</p>
<p>And that is that, unfortunately, UDI has been almost completely ignored</p>
<p><span id="more-15"></span></p>
<p>Now, I&#8217;m going to admit something straight away: UDI is not perfect. UDI has its flaws; one that has been most commonly levelled at it is that it is complex, and indeed that is true. Complexity is never great, but it is acceptable if it brings features, and it does</p>
<p>However,  this is not the main reason people rejected it. They had other ones</p>
<h2>&#8220;Why should I change my interface when it benefits my competitors?&#8221;</h2>
<p>This one often came from the Unix companies (And Novell, with regards to Netware). And it is kind of true: If you support UDI, then your competitors will benefit from drivers written from you OS; but the effect is reciprocal: you will also benefit from drivers written from their OS.</p>
<p>It gets silly, however, when they claim &#8220;We have leading hardware vendor support! This will benefit my competitors more than me&#8221;, because there is only one company who can reasonably claim that &#8211; or who could reasonably claim it any time in the foreseeable past and present.</p>
<p>But then, how many of the Unix vendors are still standing today on the basis of their Unix offerings?</p>
<h2>&#8220;UDI will be slower than my native interface&#8221;</h2>
<p>OK, maybe. However, know this: UDI was designed to be fast and scalable. UDI pretty much forces you to write scalable drivers. Any benefit you can get over UDI will be small.</p>
<p>And, indeed, when SCO¹ implemented UDI, as a wrapper on top of their native interface, they found it was faster than their native drivers. This certainly sounds counter-intuitive: the thing to take away from this is that UDI has low overhead and produces fast drivers.</p>
<h2>&#8220;UDI will not work with my operating system design&#8221;</h2>
<p>Maybe, but unlikely. UDI was designed to operate on anything from a single tasking, single threaded single CPU system, to a multi core, multi machine, multi threaded distributed system. You would have to be developing something <em>massively</em> radical to design a system in which UDI could not work</p>
<h2>&#8220;UDI doesn&#8217;t have many drivers&#8221;</h2>
<p>This is a chicken and egg problem and you know it. However, for any operating system which wants to use UDI, there are UDI enthusiasts (of which I am one) which would be most willing to help anyone trying to implement it &#8211; and even contribute to the work.</p>
<p>Remember: Every operating system supporting UDI makes it more attractive to other operating systems and driver writers. Someone must just take the plunge and support it.</p>
<h2>&#8220;UDI means closed source drivers&#8221;</h2>
<p>This one comes primarily from the FSF, but then much of what the FSF says I disagree with. OK, I&#8217;ll admit: You would likely get closed source drivers &#8211; but then you already do. Have you ever seen the source of ATi or nVIDIA&#8217;s graphics drivers? And, of course, most devices have open datasheets anyway &#8211; so even if the vendor didn&#8217;t open source their drivers we would be stuck.</p>
<p>In fact, I see very few cases where the  current state of affairs have caused a driver&#8217;s source code to be released &#8211; most drivers are developed by unrelated developers, and the same benefits of delegating much of the ongoing maintenance work to the project&#8217;s volunteers would apply.</p>
<p>However, there is a very, very compelling reason to use UDI, which I feel outweighs this:</p>
<h2>UDI gives us better quality systems</h2>
<p>This applies from two points:</p>
<ul>
<li>Driver maintenance gets shared between bigger pools of people, increasing the overall quality of drivers</li>
<li>Operating systems developers can focus more on the bits of their OS which differentiate it, since they are no longer spending as much time on maintaining drivers</li>
</ul>
<h2>UDI gives us more choice</h2>
<p>At present, in descending order of hardware support, you have</p>
<ul>
<li>Windows</li>
<li>Linux</li>
<li>The BSDs, Solaris, etc</li>
</ul>
<p>with enormous gulfs between them. This is constricting: It means that people are often forced to choose one because of their hardware, and denies users choice.</p>
<p>UDI makes closing these gulfs much, much easier, because we reduce duplication of effort.</p>
<p>UDI benefits everyone</p>
<h2>Where can I get more information?</h2>
<p>The UDI core specification can be found at <a href="http://projectudi.org/">projectudi.org</a>. However, this website hasn&#8217;t been updated since 2001. You may think that this makes UDI obsolete, but you would be wrong: Software and good engineering doesn&#8217;t rust. Daily we use technologies, such as the SCSI command set and x86 and ARM processor architectures, and operating systems, which are much older. Many of these have been modified substantially, true; but the core design concepts remain mostly unchanged since their inception.</p>
<p>Deven Corzine <a href="http://www.ties.org/deven/udi.html">said similar things about UDI</a> as I just did a while ago.</p>
<p>A group of us interested in reviving UDI have setup a website at <a href="http://udi.io">udi.io</a> and associated mailing lists and other resources upon which to conduct the discussion with regards to taking UDI forwards. The effort is new, yes, but the discussion that has proceeded it has been going on for a year, and involved some of the specification developers.</p>
<p>The UDI project is dead; long live the UDI project.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.owenshepherd.net/2010/07/23/i-believe-in-udi/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The sorry state of Unicode in C</title>
		<link>http://www.owenshepherd.net/2010/07/05/the-sorry-state-of-unicode-in-c/</link>
		<comments>http://www.owenshepherd.net/2010/07/05/the-sorry-state-of-unicode-in-c/#comments</comments>
		<pubDate>Mon, 05 Jul 2010 15:19:54 +0000</pubDate>
		<dc:creator>Owen Shepherd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[Unicode]]></category>

		<guid isPermaLink="false">http://www.owenshepherd.net/?p=7</guid>
		<description><![CDATA[Once upon a time&#8230; Things were nice and simple; all characters were the same size, and that size was 8-bits. Things were easy to handle, for the most part. Much of C&#8217;s string handling is rooted in this era, with &#8230; <a href="http://www.owenshepherd.net/2010/07/05/the-sorry-state-of-unicode-in-c/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><strong>Once upon a time&#8230;</strong></p>
<p>Things were nice and simple; all characters were the same size, and that size was 8-bits. Things were easy to handle, for the most part. Much of C&#8217;s string handling is rooted in this era, with functions like <em>isalpha</em> inexorably tied to the English language.</p>
<p>Of course, since then the world has changed.</p>
<p><strong>Things got ugly: Multi-byte character sets</strong></p>
<p>Now, these aren&#8217;t all bad; many can be considered, for most things, like the one byte per character sets that preceded them. They maintain the invariant that the contents of a multi-byte character cannot be interpreted as a single byte character. This does of course make your coding take more space, but makes legacy programs work.</p>
<p>Some encodings, however, are not so kind. A good example of this is Shift-JIS. Shift-JIS reuses the single byte character space in the second bytes of multi-byte characters. This means that, if you do strchr(&#8216;e&#8217;), you might not actually find an E.</p>
<p>Now things are getting really messy, because C doesn&#8217;t provide standard functions for dealing with these character sets. The other problem is that dealing with the  thousands of character sets in existence is rather complex.</p>
<p><strong><span id="more-7"></span><span style="font-weight: normal;"><strong>The solution: Unicode</strong></span></strong></p>
<p>Unicode, as you probably know, provides an incredible repertoire of characters, and is capable of representing pretty much every character in common (and uncommon) use.</p>
<p>Unfortunately, there is a great big problem.</p>
<p><strong>C&#8217;s Unicode support sucks</strong></p>
<p>C99 doesn&#8217;t actually define any official support for Unicode; it defines support for &#8220;Wide Characters&#8221;, which every implementation defines to be Unicode because this is the only sane option.</p>
<p>The problem is that the C standard requires the wide character set to be a fixed width encoding &#8211; which requires UCS-4/UTF-32, but nobody else uses UTF-32.</p>
<p>Worse,  C&#8217;s character handling functions (quite rightfully) only provide a small set of features &#8211; and so everybody ignores them.</p>
<p>You see, everyone has their own opinion as to which Unicode encoding to use:</p>
<ul>
<li>Unix &#8211; UTF-8 for backwards compatibility with traditional APIs</li>
<li>Windows &#8211; UTF-16 (Predates Unicode 2.0)*</li>
<li>Java &#8211; UTF-16 (Predates Unicode 2.0)</li>
<li><a title="International Components for Unicode" href="http://site.icu-project.org/">ICU</a> &#8211; UTF-16</li>
</ul>
<p>The end result is that the world&#8217;s most popular programming languages (C and C++) are left with Unicode support that is quite frankly useless.</p>
<p><strong>The Future</strong></p>
<p>C++0X doesn&#8217;t touch on the topic, but C1X saves our bacon. C1X adds the types char16_t and char32_t, which quite natively map on to UTF-16 and UTF-32. It also allows creating character constants in those encodings, which will vastly simplify use of them.</p>
<p>It will also make use of libraries like <a title="International Components for Unicode" href="http://site.icu-project.org/">ICU</a> feel a lot less like stabbing ones self in the eye.</p>
<p>In fact, there are only two major issues I see with C1X&#8217; Unicode support: Firstly, there is no convertibility defined to wchar_t. Secondly, and probably as a consequence of the former, there is no support provided for outputting Unicode strings to the console. Perhaps with time this will change.</p>
<p><em>* Microsoft defines wchar_t to be 16-bits. This is definitely non-compliant, but wasn&#8217;t when they did it: at the time 16 bits covered the entirety of Unicode. Changing it now, of course, would be impossible, as pretty much the entire Windows API is built upon passing around wchar_t`s.</em></p>
<p><em>[Correction 2010-07-07: Mistake in my recollection of C++0x removed - C++0x does in fact add char16_t/char32_t.]</em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.owenshepherd.net/2010/07/05/the-sorry-state-of-unicode-in-c/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

