Tracking php|tek Tweets With Twitterslurp

May 20th, 2009

For a client at work a few months ago, I created a Twitter search tool, Twitterslurp, that put all the tweets related to the client’s project on their webpage, updated in (close to) real-time via AJAX.

I’ve since added a lot of features, including a set of graphs, and we’ve set up a version of Twitterslurp for php|tek 2009.
Read the rest of this entry »

Pickens Plan Scaling Talk Slides Are Available

May 15th, 2009

On Wednesday night, I gave a talk at the DC PHP developer’s group regarding how The Bivings Group scaled the Pickens Plan website after it stopped working following a national advertising campaign after the first 2008 presidential debate drew thousands of people to the website.

The slides are now available for download.

Also, I want to thank everyone who showed up. It was great talking to all of you!

MySQL Performance Benefits of Storing Integer IP Addresses

March 9th, 2009

On ##php on freenode over the weekend, there was a brief discussion regarding the performance benefits of storing IP addresses in a database as an integer, rather than as a varchar. One commenter was making the argument that the numeric-to-string and string-to-numeric conversions cost involved in storing IPs as integers was more significant than the space savings.

While the space savings are easily apparent, and I’ve seen demonstrations of how much faster search operations are, I could not recall anyone ever doing an analysis of how long it takes to insert IPs as strings or integers into the database. Naturally, I had to determine an answer to this so I’d know for future reference.

My findings show that, beyond the disk space savings and the much faster index queries possible with integer-based IP addresses, even with the database doing the string-to-numeric conversion, it is 9-12% faster to store IP addresses into the database as integers, rather than strings.

Background

There are two ways you can store an IPv4 address in a database. In its native format, an IP address is a four-byte long integer, and can be represented as an int unsigned.

In its human-readable format, an IP address is a string, with a minimum length of 7 characters (0.0.0.0) and a maximum length of 15 (255.255.255.255). This gives it an average length (assuming uniform random distribution) of 13.28 characters. Accordingly, one would store it in a database field of type varchar(15). In order for the database to keep track of exactly how much data is in the column, an additional byte of data must be added to store the length of the string. This brings the actual data-storage costs of an IP represented as a string to an average of 14.28 bytes (assuming the characters can be represented by one byte per character, as with latin1 or utf8).

This means that storing an IP address as a string requires, on average, about 10 bytes of extra data. In an application that saves access logs, that 10 bytes of data will eventually add up, and that should be reason enough to store IPs as an integer, rather than a string.

Disk Space Is Cheap, So No Big Deal, Right?

There are other costs associated with having larger data fields. If the column is indexed, the index will be larger as well. Larger indexes tend to perform slower than smaller indexes. Additionally, while disk space is plentiful and cheap, RAM is considerably more limited, so more memory will be used to cache the data or indexes, potentially pushing other, more valuable content out of the cache.

Further, while disks have been gradually getting faster, it still takes a relatively long time to read data from (and even longer to write to) a disk, and CPUs have gotten faster much more quickly than disks. The more data that has to be moved around, the more time the CPU wastes moving that data instead of performing more interesting work.

So, What’s the Actual Cost?

I wrote a PHP script that generates and inserts 1,000,000 random IP addresses into four mysql tables and timed the results. All tests were done with PHP 5.3alpha2 and MySQL 5.0.67 on a 2.16 GHz MacBook Pro (Intel Core Duo), with MyISAM tables.

The test uses four tables as described below. Each table has three columns: an id int unsigned not null auto_increment primary key column, and ip and s as described below:

table ip s description
inetbench_long int unsigned not null char(1) not null fixed row length
inetbench_long_dynamic int unsigned not null varchar(1) not null dynamic row length
inetbench_varchar varchar(15) not null varchar(1) not null dynamic row length; IP stored as string
inetbench_varchar_utf8 varchar(15) not null varchar(1) not null same as inetbench_varchar, but charset = utf8

The PHP benchmark script generates 1 million random IP addresses, and then inserts that data into MySQL To make MySQL do as much work as possible, for the inetbench_long tables, the sql query string uses INET_ATON() to convert the IP to an integer, rather than attempting to do it in PHP.

The results:

table insert time avg row length data length index length total length
inetbench_long 132.35 sec 10 bytes 10,000,000 22,300,672 32,300,627
inetbench_long_dynamic 132.71 sec 20 bytes 20,000,000 22,300,672 42,300,627
inetbench_varchar 144.44 sec 24 bytes 24,504,148 36,341,760 60,845,908
inetbench_varchar_utf8 148.86 sec 24 bytes 24,504,148 36,341,760 60,845,908

MySQL is adding 1 byte of overhead per row to inetbench_long, 10 bytes/row to inetbench_long_dynamic, and an average of 5 bytes/row of overhead to inetbench_varchar and inetbench_varchar_utf8.

The s column was added to test the performance difference of fixed vs. dynamic row storage, but it turns out there’s not much difference.

You can download the code used to generate these results here.

Analysis

Storing IPs as a string, besides requiring more disk space, takes 9% longer than storing them as integers, even with the overhead of converting the IP from a string to an integer. If the table uses utf8 encoding, it’s 12% slower. (This should not be surprising: UTF-8 is inherently slower to process than a strictly 8-bit encoding.) Storing data as an integer in a table with a dynamic row length is not appreciably slower. The indexes on the string tables are 63% larger.

For good measure, I tested a few select queries against the tables. The results were actually somewhat interesting.

Doing a search where you’re looking for a specific IP, or a range of IPs that can be satisfied with a like clause resulted in no significant difference between the integer and varchar storage. (This was somewhat surprising to me.)


select benchmark(10000000, (select count(*) from inetbench_long where ip between inet_aton('172.0.0.0') and inet_aton('172.255.255.255')));
1 row in set (0.41 sec)

select benchmark(10000000, (select count(*) from inetbench_varchar where ip like '172.%'));
1 row in set (0.42 sec)

However, as expected, when the search range can’t be represented with a simple like query, the speed difference between numeric and string indexes really show:


select benchmark(35000000, (select count(*) from inetbench_long where ip between inet_aton('172.16.0.0') and inet_aton('172.31.255.255')));
1 row in set (1.43 sec)

select count(*) from inetbench_varchar where inet_aton(ip) between inet_aton('172.16.0.0') and inet_aton('172.31.255.255');
1 row in set (1.47 sec)

select count(*) from inetbench_varchar_utf8 where inet_aton(ip) between inet_aton('172.16.0.0') and inet_aton('172.31.255.255');
1 row in set (1.72 sec)

This results in an integer search that’s about 35 million times faster than the string search. Also, the utf8 table is about 17% slower than the latin1 table (which isn’t too surprising, given UTF-8’s overhead).

In Conclusion…

With “only” a difference of 12 microseconds per insert query, it may not make sense to change an existing database if you’re not doing many queries against stored IPs. If you’re doing IP range queries, though, you probably want to convert your tables. Any new development should be storing IP addresses in the database as integers by default. The space and time savings are worth it.

Once IPv6 becomes more prevalent, the savings will only become larger: a 128-bit (16 byte) IPv6 address can be up to 39 characters long when represented in a “human readable” format. (Storing IPv6 addresses in the database is going to be a bit more difficult, as MySQL doesn’t have a native 16-byte-wide data type.)

The load on MySQL when inserting integer IPs could likely be slightly reduced by doing that conversion in your application, rather than using MySQL’s INET_ATON() function.

Aggregate Map Tools, Part 1: GlobalMapTiles

February 26th, 2009

My fourth open source release this week comes from work I’ve done for my employer, The Bivings Group. Today, we are releasing a set of code that assists with aggregating markers on a Google Map. Our clients wanted to be able to display markers on a map reflecting the locations of people who provided their location (city, state, zip, and in some cases, street address), but with tens of thousands of expected sign-ups, it’s not feasible to display all the points on the map at once.

Read the rest of this entry »

Week of Open Source Releases: xarfs 0.2

February 25th, 2009

This week’s third piece of software is xarfs, a proof-of-concept MacFUSE-based filesystem that mounts xar archive files as a read-only filesystem.

I wrote xarfs because I wanted an easy way to be able to inspect Mac OS X Installer package files, which as of Mac OS X 10.5, began to be xar archives. In particular, I wanted to be able to lsbom the package manifest in order to see what files were being modified. (In retrospect, it probably would have been easier, though less satisfying, to have written a simple tool that extracts only the package manifest and lists its contents.)

Although xarfs provides easy access to view the contents of a xar archive, it has some pretty significant limitations. Besides having a rather clumsy user interface, and presenting only a read-only interface to xar files, it handles large files spectacularly poorly, decompressing them entirely into memory regardless of how much of the file is being accessed. (This is a limitation in the xar API.) It’s also written in Objective C using MacFUSE’s framework. This rather limits its portability; an implementation using the regular FUSE API would probably allow xarfs to work on Linux.

If anyone’s interested in this, it’s probably possible to resolve some of its deficiencies and make it a lot more useful. Let me know if you find it helpful.

xarfs is licensed under the three-clause BSD license. For more information, or to download, see the xarfs page.

Week of Open Source Releases: slow 1.0

February 24th, 2009

The second piece of software I’m releasing this week is a unix command line tool called slow. Slow is a tool that makes it easy to cause another process to run slowly.

Slow does this by taking advantage of two signals, SIGSTOP and SIGCONT, which respectively stop and re-start a unix process’ execution.

I’ve used this tool to prevent long-running cpu-bound processes from consuming an entire cpu to improve performance of other processes on the machine. I’ve also used it to slow part of a two-part data import process where temporary data files could be created much faster than they could be processed and removed, running the risk of low disk space conditions.

Slow is released under the GPL v3. For documentation, more information, and download links for slow and its source code, please see slow’s web page.

Safari 4 beta and WebKit release analysis

February 24th, 2009

Earlier today, Apple released a beta version of Safari 4, with the version number 528.16.

As is common knowledge, Safari is based on the open-source WebKit rendering engine, which releases nightly builds. For reference, the current WebKit version is 520.1+.

Because WebKit is open source, using WebKit’s version history tracker, it’s possible to determine the following interesting information:

* The Safari-6528 branch was created on January 29th based on the state of the WebKit trunk at r40289, from January 27th.

* There have been 75 merges of code (presumably bug fixes) into the 6528 branch since then. The last of these merges was r41097 (merging r41091) on February 19.

* The release is tagged in WebKit svn as r41182, the/releases/Apple/Safari 4 Public Beta tag.

This means that the Safari 4 beta is based roughly on the state of WebKit as it was at the end of January, one month behind current development. Considering the importance of WebKit to Safari in general, and Mac OS X as a whole, a one month delay for QA purposes is probably pretty good.

Kudos to all of the people working on WebKit! Keep up the good work!

Week of Open Source Releases: Unlockupd 1.0.2

February 23rd, 2009

The first piece of software I’m releasing this week is an updated version of Unlockupd. Unlockupd is a Mac OS X daemon that works around a bug in lookupd, a system service which is required for proper operation of Mac OS X 10.0 – 10.4. If lookupd fails, the system quickly becomes unusable. Unlockupd periodically checks lookupd’s status and forces it to restart should it fail.

Fortunately, the bug has been fixed in Mac OS X 10.5, which does not have a lookupd service, but it was present in Mac OS X 10.3 and 10.4.

Unlockupd was first released in 2004, and it really should have been released under an open source back then. Better late than never, though. At this point, it’s really more of an historical relic than a useful program, but hopefully someone will find some use with the code.

More information is available on Unlockupd’s web page, including download links for Unlockupd and its source code. Unlockupd is licensed under the GPL v3.

New Blog

February 22nd, 2009

Hi! I’m John Bafford, and welcome to my website.

Long overdue, this new website and blog are now live. I’ll be talking mostly about PHP and related technology — I’ve spent the last almost ten years writing PHP (and JavaScript) code, as well as a smattering of Perl, Python, and other languages, after many prior years writing in C/C++. Expect some Mac talk as well, on occasion.

To “celebrate” the new website, throughout this week, I’ll be doing a number of (in some cases, very belated) open source releases of some software I’ve worked on over the years. There will be a somewhat Mac focus to some of these releases, but even it you don’t have a Mac, hopefully, you’ll still find them interesting enough to follow along, and I’ll get to the PHP-related content soon enough.

If you have questions or comments, feel free to comment here, or send me an email from the feedback form via the button at the bottom of the page.

WordPress Comments Vote Plugin Released

January 9th, 2009

WordPress Comments Vote plugin released. More details on The Bivings Report, my employer’s blog.