Jonesy's blog feed
Cloud computing hype overload
I’ve been working with what I used to call “utility computing” tools for about 6-9 months. However, for about the past 2 months, I’ve been seeing the term “cloud computing” all over the place, and there is so much buzz surrounding it that it’s reaching that magical point best described using Alan Greenspan’s words: “Irrational Exuberance”.
When Alan Greenspan used those words to describe the attitudes of investors toward the markets, what he was basically saying was that there were people who didn’t really know what they were doing, putting more money than they ought, into things they knew relatively little about. Further, he was saying that the decisions people were making with regards to where to put their money were a) bad, or at least b) not based on sound reasoning, or the ‘facts on the ground’.
This, I think, is where we are at with “cloud computing”. The blog post that put me over the edge is this one, for the record. I read Sean’s writings often enough, but this one strikes me as being a little off, a little sensationalistic, not based in reality, and a little misleading.
Maybe he just didn’t put enough qualifiers in there. His post might make more sense if he limited its scope and provided more facts, but I guess it’s just an opinion piece so he decided not to go that route, and that’s his prerogative I guess.
By limiting the scope, I mean he should’ve realized that there are millions of web sites currently scaling quite nicely without the use of cloud computing. In addition, some of the new ones that are having issues are also not using cloud computing, and when they hit bumps in the road, they make it through, and the great thing is that they also share their stories, and those stories indicate that a cloud (or, the current cloud offerings) wouldn’t have helped much (there’s lots of other evidence of that too). What would’ve helped is if they had paid more attention to:
- monitoring
- initial infrastructure design
- their own app code and app design
This is how I want all project web sites to look…
My brain has a set of rules that software project websites get tested against. Each time a project site fails to comply with a rule, I get ever-so-slightly more annoyed, and ever-so-slightly less likely to use the software in question (if there are alternatives, this is even maybe not so “slightly”).
I thought I’d list these rules because I suspect others are like me: we’re extremely busy, we work too many hours, and are involved with too many projects to spend hours trying to figure out what some piece of code someone mentioned once in IRC actually does.
But first, know that this site actually complies with just about every single rule there is, so it’s a great template to work from if your site needs brushing up.
- First and foremost, tell me, right away, what this thing does, the problem it solves, and (at a high level) the approach taken to solve the problem.
- Tell me the language it’s written in. If it’s open source, and it’s written in a language I hack in, *and* it solves a problem I need solved, maybe I can help out, or be encouraged that if something flakes, I can fix it, or at least speak the developer’s language if I have to describe the issue to the folks upstream.
- Tell me what OS is required, and preferably what OS/version is tested with.
- Give me a full list of dependencies with links to go get them, or give me a link to “Dependencies”, or to an install document that lists them.
- Tell me the current version, and the date it was released. Beta versions and dates are nice too. If there is a timed release schedule, tell me that.
- Keep the information up-to-date. I shouldn’t have to wonder if your software is going to work under OS X 10.5 or RHEL 5, or if your plugin will work under the latest version of Drupal/Django/Moodle/MySQL/Joomla/Firefox…
- BONUS: a very simple architectural drawing that shows me exactly what components make up the whole. The one for CouchDB is as good as any I’ve ever seen (assuming it’s accurate).
- BONUS: if screenshots are applicable, use them. They communicate a million times more information using a million times less real estate and bandwidth. They can communicate things you didn’t even know you were communicating. Of course, that could be good or bad, but it keeps you honest, and customers like that :-)
- DON’T require me to understand how something like Trac or some other tool works in order to get at the information about your software project. Navigation should not assume I’m a developer, it should assume I’m a prospective user who will leave if they can’t read the menu. If you want to use a project management tool to do your work, more power to you, but as a prospective customer, it’s none of my business — don’t drag me into your personal hell! I just want the software!
- DON’T be satisfied with the Sourceforge page as your project’s “homepage”. The problem with doing that is twofold: first, Sourceforge kinda sucks, and occasionally becomes unusable. Second, it doesn’t provide a simple way for you to give me information, nor a simple way for me to find it even if you produce said information using their tools. Also, it’s bad form. If you haven’t committed to the project enough to give it a proper site, well…
- DON’T put some kind of “Coming Soon” page with a bunch of information with *NO DATE*, because I’m going to go ahead and assume that this thing is vaporware, and that the “coming soon” post is 3 years old. Nothing in this world is more annoying than time-sensitive information being plastered on a web site with no date.
- DO NOT — I repeat — DO NOT force me to download a 20MB tarball to get at the documentation. That’s not how things work. I get to see what I’m downloading *before* I download it. You’ll save me some time, and save yourself some bandwidth, and you’ll have more accurate statistics about how many people download and use your software, because the numbers won’t be skewed by folks who were forced to download the package to get at the documentation.
Plug-ins: isn’t there a better way?
If there’s one thing that bothers me about using a ready-made solution like wordpress for my blog, it’s plug-ins. I hate software plug-ins. The first question every support engineer for any software product that supports plugins asks in response to a trouble report is “are you using any plugins?” And when you say “yep, I’m using plugins!” the reply from support is to disable them immediately and see if the trouble goes away. That’s a problem.
What’s worse, if the plugins are maintained by a third party (often the case), there’s no telling whether or not they’ll exist when the next version of the base software is released, or whether they’ll be supported in future versions of the software.
Two examples that touch my daily life are Firefox, and Wordpress.
Lately (since around March) I’ve been having lots of trouble with Firefox. I thought upgrading to Firefox 3 would’ve helped, but it really didn’t. Running it on OS X, Firefox hangs frequently enough that I’m actually considering using Safari (I do NOT like Safari). Know what happened right around that time? Ah - I found the firefox plugins for managing EC2 and S3. So today I’ll uninstall those and see if it helps.
With Wordpress, there are two things I’m missing: I need to let readers subscribe to comments via email, and I need better Google AdSense for Search integration with WordPress. Both things are kinda maybe supported in one version or another “but should work under…” - whatever. I don’t really want to spend my time downloading, reading the documentation to do the install, doing the install and configuration, etc., and then finding out that it doesn’t work, or worse, having it look on the surface like it works, but then finding later that it fails in evil-but-silent ways.
These two products are by no means exceptions. Moodle, PHP-Nuke, XOOPS, MediaWiki, Twiki, Postnuke… and for that matter, OpenLDAP, BIND, SSH, MySQL, Sendmail, PAM… all have plugins available written by other folks, and all have bitten me at one point or another. Usually when it comes time to upgrade the base software.
I’m not saying anything new here. People have had this problem with lots of different software products for a long time. My question is “why is this still a problem?” I’m not asking this because I have some magical obvious solution or answer, I’m asking because I feel like there’s probably more to it than I’m grasping. I’m not a masterful developer, or even a masterful software project manager, so I’m calling on all of you who are (or are closer than I am) to help me understand the problem. Some day, I might find myself in a position to take the wrong or right path where plug-ins are concerned, and I’d like to be more informed than I am so I can avoid putting users in the position I find myself in when I use other peoples’ software. Has Joel blogged this yet? If so, I can’t find it. Links please?
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F06%2F25%2Fplug-ins-isnt-there-a-better-way%2F'; addthis_title = 'Plug-ins%3A+isn%26%238217%3Bt+there+a+better+way%3F'; addthis_pub = 'jonesy';PyWorks 2008, November 12-14, Call for Papers Open!
Yes, the same folks who bring you Python Magazine and php|architect magazine (and several other things, like online training, a full line of books, and more conferences), are hosting our first ever Python conference! You can see more about it, and the Call for Papers, at the conference site.
The hotel which once hosted php|works in Atlanta is actually large enough to host both php|works *and* PyWorks in the same venue, so this year the two conferences will be held at the same time, and the plan/hope is to be able to have talks that are generic enough to be of use to either audience, like talks about scaling MySQL, or SVN management, or Hadoop, or Amazon Web Services, or something like that. In any case, the attendees of either conference will be allowed to cross over to attend talks at the other if they so choose, which I think is pretty cool. Maybe we can add more languages in future years and just have it be called “LANG ‘08″ or something. See the PyWorks Call for Papers if you have ideas!
I’m really excited to get down to Atlanta and meet the guys I’ve interacted with down there from the Python community, including Doug Hellmann, whom I’ve worked pretty closely with over the past several months. There’s also a thriving Python community in that area, I hear. Looking forward to it!
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F06%2F25%2Fpyworks-2008-november-12-14-call-for-papers-open%2F'; addthis_title = 'PyWorks+2008%2C+November+12-14%2C+Call+for+Papers+Open%21'; addthis_pub = 'jonesy';Why should I pay for this AWS design decision?
I was writing a utility in Python (using boto) to test/play with Amazon’s SQS service. As boto isn’t particularly well documented where SQS specifically is concerned, I also plan to post some examples (either here or on Linuxlaboratory.org, or both). When I had some trouble getting a message that was sent to a queue, I went to the Amazon documentation, and found this little gem in the Amazon Web Services FAQ
I am sure that my queue has messages, but a call to ReceiveMessage returned none. What could be the problem?Due to the distributed nature of the queue, a weighted random set of machines is sampled on a ReceiveMessage call. That means only the messages on the sampled machines are returned. If the number of messages in the queue is small (less than 1000), it is likely you will get fewer messages than you requested. If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response. Your application should be prepared to poll the queue until a message is received. Note that with the 2008-01-01 version of Amazon SQS, you’re charged for each request you make, so set your polling frequency with that in mind.
So… if you were planning to decouple application components using SQS using an ‘eventual consistency’ model, keep in mind that they’re using the same model, and that they’re charging you for the privilege of eventually getting the messages you’ve already paid to put there, but aren’t necessarily available at any given point in time. I personally think this is a little goofy, and wrong.
If I put a message in a queue, I should be charged for actually getting the message. I should *not* be charged for checking to see if Amazon’s internal workings have made my messages available to me yet.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F06%2F23%2Fwhy-should-i-pay-for-this-aws-design-decision%2F'; addthis_title = 'Why+should+I+pay+for+this+AWS+design+decision%3F'; addthis_pub = 'jonesy';A Couple of MySQL Performance Tips
If you’re an advanced MySQL person, you might already know these, in which case, please read anyway, because I still have some questions. On the other hand, f you’re someone who launched an application without a lot of database background, thinking “MySQL Just Works”, you’ll eventually figure out that it doesn’t, and in that case, maybe these tips will be of some use. Note that I’m speaking specifically about InnoDB and MyISAM, since this is where most of my experience is. Feel free to add more to this content in the comment area.
InnoDB vs. MyISAM
Which one to use really depends on the application, how you’re deploying MySQL, your plans for growth, and several other things. The very high-level general rule you’ll see touted on the internet is “lots of reads, use MyISAM; lots of writes, use InnoDB”, but this is really an oversimplification. Know your application, and know your data. If all of your writes are *inserts* (as opposed to updates or deletes), MyISAM allows for concurrent inserts, so if you’re already using MyISAM and 90% of your writes are inserts, it’s not necessarily true that InnoDB will be a big win, even if those inserts make up 50% of the database activity
In reality, even knowing your application and your data isn’t enough. You also need to know your system, and how MySQL (and its various engines) use your system’s resources. If you’re using MyISAM, and you’re starting to be squeezed for disk space, I would not recommend moving to InnoDB. InnoDB will tend to take up more space on disk for the same database, and the If you’re squeezed for RAM, I would also not move to InnoDB, because, while clustered indexes are a big win for a lot of application scenarios, it causes data to be stored along with the index, causing it to take up more space in RAM (when it is being cached in RAM).
In short, there are a lot of things to consider before making the final decision. Don’t look to benchmarks for much in the way of help — they’re performed in “lab” environments and do not necessarily model the real world, and almost certainly aren’t modeled after your specific application. That said, reading about benchmarks and what might cause one engine to perform better than another given a certain set of circumstances is a great way to learn, in a generic sort of way, about the engines.
Indexing
Indexes are strongly tied to performance. The wrong indexing strategy can cause straight selects on tables with relatively few rows to take an inordinately long amount of time to complete. The right indexing strategy can help you keep your application ‘up to speed’ even as data grows. But there’s a lot more to the story, and blind navigation through the maze of options when it comes to indexing is likely to result in poorer performance, not better. For example, indexing all of the columns in a table in various schemes all at once is likely to hurt overall performance, but at the same time, depending on the application needs, the size of the table, and the operations that need to be performed on it, there could be an argument for doing just that!
You should know that indexes (at least in MySQL) come in two main flavors: clustered, and non-clustered (there are other attributes like ‘hashed’, etc that can be applied to indexes, but let’s keep it simple for now). MyISAM uses non-clustered indexes. This can be good or bad depending on your needs. InnoDB uses clustered indexes, which can also be good or bad depending on your needs.
Non-clustered indexing generally means that the index consists of a key, and a pointer to the data the key represents. I said “generally” - I don’t know the really low-level details of how MySQL deals with its non-clustered indexes, but everything I’ve read leads me to believe it’s not much different from Sybase and MSSQL, which do essentially the same thing. The result of this setup is that doing a query based on an index is still a two-step operation for the database engine: it has to scan the index for the values in the index, and then grab the pointer to get at the data the key represents. If that data is being grabbed from disk (as opposed to memory), then the disk seeks will fall into the category of “random I/O”. In other words, even though the index values are stored in order, the data on disk probably is not. The disk head has to go running around like a chicken without a head trying to grab all of the data.
Clustered indexes, by comparison, kinda rock. Different products do it differently, but the general idea is that the index and the data are stored together, and in order. The good news here is that all of that random I/O you had to go through for sequential range values of the index goes away, because the data is right there, and in the order dictated by the index. Another big win here which can be really dramatic (in my experience) is if you have an index-covered query (a query that can be completely satisfied by data in the index). This results in virtually no I/O, and extremely fast queries, even on tables with a million rows or more. The price you pay for this benefit, though, can be large, depending on your system configuration: in order to keep all of that data together in the index, more memory is required. Since InnoDB used clustered indexes, and MyISAM doesn’t, this is what most people cite as the reason for InnoDB’s larger memory footprint. In my experience, I don’t see anything else to attribute it to myself. Thoughts welcome.
Indexes can be tricky, and for some, it looks like a black art. While I am a fan of touting proper data schema design, and that data wants to be organized independently of the application(s) it serves, I think that once you get to indexing, it is imperative to understand how the application(s) use the data and interact with the database. There isn’t some generic set of rules for indexing that will result in good performance regardless of the application. You also don’t have data integrity issues to concern yourself with when developing an index strategy. One question that arises often enough to warrant further discussion is “hey, this column is indexed, and I’m querying on that column, so why isn’t the index being used?”
The answer is diversity. If you’re running one of those crazy high performance web 2.0 bohemuth web sites, one thing you’ve no doubt tossed around is the idea of sharding your data. This means that, instead of having a table with 400,000,000 rows on one server, you’re going to break up that data along some kind of logical demarcation point in the data to make it smaller so it can be more easily spread across multiple servers. In doing so, you might create 100 tables with 4,000,000 rows apiece. However, a common problem with figuring out how to shard the data deals with “hot spots”. For example, if you run Flickr, and your 400,000,000 row table maps user IDs to the locations of their photos, and you break up the data by user ID (maybe a “user_1000-2000″ for users with IDs between 1000 and 2000), then that can cause your tables to be contain far less diverse data than you had before, and could potentially cause *worse* performance than you had before. I’ve tested this lots of times, and found that MySQL tends to make the right call in these cases. Perhaps it’s a bit counterintuitive, but if you test it, you’ll find the same thing.
For example, say that user 1000 has 400,000 photos (and therefore, 400,000 rows in the user_1000-2000 table), and the entire table contains a total of 1,000,000 rows. That means that user 1000 makes up 40% of the rows in the table. What should MySQL do? Should it perform 400,000 multi-step “find the index value, get the pointer to the data, go get the data” operations, or should it just perform a single pass over the whole table? At some point there must be a threshold at which performing a table scan becomes more efficient than using the index, and the MyISAM engine seems to set this threshold at around 30-35%. This doesn’t mean you made a huge mistake sharding your data — it just means you can’t assume that a simple index on ‘userID’ that worked in the larger table is going to suffice in the smaller one.
But what if there just isn’t much diversity to be had? Well, perhaps clustered indexing can help you, then. If you switch engines to InnoDB, it’ll use a clustered index for the primary key index, and depending on what that index consists of, and how that matches up with your queries, you may find a solution there. What I’ve found in my testing is that, presumably due to the fact that data is stored, in order, along with the index, the “table scan” threshold is much higher, because the number of IO operations MySQL has to perform to get at the actual data is lower. If you have index-covered queries that are covered by the primary key index, they should be blazing fast, where in MyISAM you’d be doing a table scan and lots of random I/O.
For the record, and I’m still investigating why this is, I’ve also personally found that secondary indexes seem to be faster than those in MyISAM, though I don’t believe there’s much in the way of an advertised reason why this might be. Input?
Joins, and Denormalization
For some time, I read about how sites like LiveJournal, Flickr, and lots of other sites dealt with scaling MySQL with my head turned sideways. “Denormalize?! Why would you do that?!” Sure enough, though, the call from on high at all of the conferences by all of the speakers seemed to be to denormalize your data to dramatically improve performance. This completely baffled me.
Then I learned how MySQL does joins. There’s no magic. There’s no crazy hashing scheme or merging sequence going on here. It is, as I understand it (I haven’t read the source), a nested loop. After learning this, and twisting some of my own data around and performing literally hundreds, if not thousands of test queries (I really enjoy devising test queries), I cringed, recoiled, popped up and down from the ceiling to the floor a couple of times like that Jekkyl and Hyde cartoon, and started (carefully, very carefully) denormalizing the data.
I cannot stress how carefully this needs to be done. It may not be completely obvious which data should be denormalized/duplicated/whatever. Take your time. There are countless references for how to normalize your data, but not a single one that’ll tell you “the right way” to denormalize, because denormalization itself is not considered by any database theorists to be “right”. Ever. In fact, I have read some great theorists, and they will admit that, in practice, there is room for “lack of normalization”, but they just mean that if you only normalize to 3NF (3rd Normal Form), that suits many applications’ needs. They do *NOT* mean “it’s ok to take a decently normalized database and denormalize it”. To them, normalization is a one way street. You get more normalized - never less. These theorists typically do not run highly scalable web sites. They seem to talk mostly in the context of reporting on internal departmental data sets with a predictable and relatively slow growth rate, with relatively small amounts of data. They do not talk about 10GB tables containing tens or hundreds of millions of rows, growing at a rate of 3-500,000 rows per day. For that, there is only anecdotal evidence that solutions work, and tribal war stories about what doesn’t work.
My advice? If you cannot prove that removing a join results in a dramatic improvement in performance, I’d rather perform the join if it means my data is relatively normalized. Denormalization may appear to be something that “the big boys” at those fancy startups are doing, but keep in mind that they’re doing lots of stuff they’d rather not do, and probably wouldn’t do, if they had the option (and if MySQL didn’t have O(n2 or 3) or similar performance with regard to joins).
Do You Have an IO Bottleneck?
This is usually pretty easy to determine if you’re on a UNIX-like system. Most UNIX-like systems come with an ‘iostat’ command, or have one readily available. Different UNIX variants show different ‘iostat’ output, but the basic data is the same, and the number you’re looking for is “iowait” or “%iowait”. On Linux systems, you can run ‘iostat -cx 2′ and that’ll print out, every 2 seconds, the numbers you’re looking for. Basically, %iowait is the percentage of time (over the course of the last 2-second interval) that the CPU had to hang around waiting for I/O to complete so it would have data to work with. Get a read of what this number looks like when there’s nothing special going on. Then take a look at it on a moderately loaded server. Use these numbers to gauge when you might have a problem. For example, if %iowait never gets above 5% on a moderately loaded server, then 25% might raise an eyebrow. I don’t personally like when those numbers go into double-digits, but I’ve seen %iowait on a heavily loaded server get as high as 98%!
Ok, time for bed
I find database-related things to be really a lot of fun. Developing interesting queries that do interesting things with data is to me what crossword puzzles are to some people: a fun brain exercise, often done with coffee. Performance tuning at the query level, database server level, and even OS level, satisfies my need to occasionally get into the nitty-gritty details of how things work. I kept this information purposely kind of vague to focus on high-level concepts with little interruption for drawn out examples, but if you’re reading this and have good examples that support or refute anything here, I’m certainly not above being wrong, so please do leave your comments below!
Blogged with Flock
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F05%2F12%2Fa-couple-of-mysql-performance-tips%2F'; addthis_title = 'A+Couple+of+MySQL+Performance+Tips'; addthis_pub = 'jonesy';rrdpy - Thanks, Corey!
I have a somewhat unique situation to deal with in terms of monitoring. I need to put a graph a bunch of historical data mined from web server logs. I can get so far with loghetti, which is coming along and is great for certain things, but there’s a bridge missing between it and something like MRTG. I’m pretty sure that with a custom output plugin and Corey’s rrdpy, I can make it the rest of the way. In fact, I had been poring over the documentation for RRDTool and the various language bindings figuring out which way to go first just as this module was released.
I still may decide to graph the historical data using some generic ‘feed this your data in one big heap and this will chart it’ type of thing, but even that may be possible here, since rrdpy includes rrd_make.py, which may or may not (I haven’t looked yet) support the requisite arguments you need to pass to get historical data to work (I think you need to support a start time, for example).
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F29%2Frrdpy-thanks-corey%2F'; addthis_title = 'rrdpy+-+Thanks%2C+Corey%21'; addthis_pub = 'jonesy';Ubuntu 8.04 and Python Editors
So I updated one of my laptops to Ubuntu 8.04 pretty much as soon as it was available. I’ve been using my MacBook Pro laptop for everything for probably over a year now, because I grew tired of the hobby that *is* running Linux on a laptop and getting everything to work. I’ll note that I *do* run Linux on every server I maintain that I can think of
The first test for this laptop was wireless. I bought this laptop (Lenovo T61) specifically because it got rave reviews for its Linux compatibility. I was careful to order the laptop with the proper video and wireless chipsets that had the best support. However, 2 things annoyed me so much that I went back to the MacBook for everything:
- Wireless hung, and hung often, and in a way that it was unrecoverable.
- Lenovo put the Escape key in the worst place they could possibly put it, especially for a Vim user. Changing the key mapping caused issues with other apps, and configuring the key mapping inside .vimrc doesn’t help on the 30 other servers I use it on (ssh’d in from this laptop) :-/
Really, it was the wireless that did it. I work 100% remotely on everything I do. So, 8.04 seems to have fixed the wireless issues. The next thing I wanted to do was check out all of the Python IDE/editors I couldn’t use on the Mac (or, not easily). So I used Synaptic Package Manager to install all of the ones I could find. I’m sorry to say that I personally had Problems with most of them:
- DrPython launched fine, but using the file browser to open a file resulted in…. a no-op. I’m sorry, but an editor needs to be able to open a file.
- PyPE failed to launch altogether! It looks like it’s going to open, it spins for 5-10 seconds, and then just disappears. No window is ever shown, but a tab does appear in the bar on the bottom of the screen.
- Pida allows you to choose an external editor, so I chose Vim, and that kinda worked, but I really just want the key bindings, not the whole editor, and there’s no option to use some default built-in editor that has code folding and autocompletion and stuff. It appeared to me to be so close to gvim that I decided to skip it. I tried to stick around and give it a chance by reading the docs, but alas, the only thing under “Help” is “About”. Seems there are still a number of open source developers more concerned with getting credit than getting users.
- Stani’s Python Editor looked pretty nice, but I couldn’t find any easy way to change the syntax coloring, and while there is a manual, you have to donate to get your hands on it. This is nonsense. If you want to sell some kind of advanced documentation, fine, but you can’t expect me to donate to a project that I don’t even know if I want to use yet! “Please pay me so that you can see if this product fits your needs”…. it just doesn’t work that way. What you’ve done is given me a product that is complex enough that you pretty much need a manual just to get started, and then deprived me of that. Why not just give me the manual and a 30 day trial, after which I have to donate? I’d have no problem doing that if I planned on keeping it around. In fact, that’s how plenty of Mac applications work. I’ll pay for software that does what I need, but this game that’s being played is just offensive.
- Eclipse with PyDev, I can use this, but I don’t like it a whole lot. The good news is that there’s an SVN plugin (subclipse), and a plugin for vi keybindings if I want to pay for it (it’s only $20 - not bad if you use it a lot). The interface is a little clunky to me, and there’s no easy “change your syntax color scheme to this” type functionality. If you want a dark background and light colored text, you actually have to go to one place to change the background color, the color of the line numbering area, etc., and then go to another place to change the colors associated with the different elements of your particular language. That’s annoying for two reasons: first, it’ll take forever to get things the way I want, and second, if I installed this on another machine, I couldn’t just move over some kind of theme file and have my settings ready to go (as far as I know).
In the end, it looks like my three favorite editors are still Komodo Editor, JEdit, and Vim. What’s your favorite Python editor for Linux?
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F28%2Fubuntu-804-and-python-editors%2F'; addthis_title = 'Ubuntu+8.04+and+Python+Editors'; addthis_pub = 'jonesy';Python Magazine April Issue is Out
Hi all,
This month’s Python Magazine has been released, containing a few really great articles, including on about using the Google API and Google Spreadsheets to create a database “in the cloud”. For you scientist types, there’s also an excellent article about BioPython. For XCode users, there’s an article about scripting XCode with Python, and there’s also some in-depth coverage of PyTables, which I thought was really interesting. Have a look, and enjoy!
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F24%2Fpython-magazine-april-issue-is-out%2F'; addthis_title = 'Python+Magazine+April+Issue+is+Out'; addthis_pub = 'jonesy';Spring Means Blooming Flowers… and Ideas
I seem to have found a pattern in my own internal workings. In the fall, I work furiously and get a lot done. Around the time of the winter holidays, I almost always do major personal web site changes and upgrades according to a mental list I’ve compiled over the previous year.
In the spring, I shake off the winter (I’m not a fan of winter), I brew my first batch of beer for the season (which symbolizes the end of winter, because I brew outdoors), and my brain starts to be flooded with new ideas. They range from the simplistic (maybe we should consider replacing windows in the house this year), to the slightly odd (why isn’t there a bluetooth setup that pairs two devices and alerts you if they get out of range, so if my daughter strays too far…), to the really useful (I should really take on that woodworking project to build that bookcase we desperately need), to the GEEKY!
This year I seem to be having a lot of geeky ideas. The difference is that, this year, I finally feel empowered enough to go after some of them. One idea that has come up is building an online brewer’s workshop. I would just build a GUI to do this for myself, but then I’d have to deal with which widget set to use, which platforms to support, and whatever else. Also, the final step in the evolution of a lot of GUIs is webification anyway. So I *think* this might be a job for Python, and I *think* I might try to do this using Django, which is fully supported by my web host (finally - see yesterday’s post)!
Brewing is one of those things that you can make as complex as you care to get. I started brewing with a buddy using a Coleman picnic cooler, a few buckets, and some odds and ends from the kitchen. Now I have a full three keg system, with pumps, plate chillers (small plate heat exchangers), fancy false bottoms, cool valves and tubing, and it involves relatively little manual labor. And that complexity can infect recipe development as well. Hops add bitterness by leeching alpha acids into the wort (the liquid that is not yet beer). Hop utilization calculations can be non-trivial and depend on many other factors in your system. Other characteristics depend heavily on the percent of available sugars you’re able to extract from the grains, your ability to keep a mash at a given temperature for a fixed period of time. This is easier to predict if you know, for example, the thermal mass of the vessels involved, and how much heat will be lost when you combine water and grain and stir. There are also proteins at work in the mash which can gum things up enough to make draining the liquid off a chore, so knowing what water/grain ratio to use is also important. And how quickly can you bring wort from boiling down to a temperature more friendly to yeast at the end of the cycle?
That’s a small fraction of the considerations you *could* make when brewing. I didn’t even touch on pH and water characteristics, or yeast attenuation! Needless to say, brewing with any consistency would be a great challenge and take a good bit more preparation without some tool to help you figure out how much water you’ll need, how many ounces of hops for how long, and how much grain you need to mash (and for how long), etc. There are lots of tools to help brewers out with this kind of stuff (ProMash is a popular one). The problem I have is that these tools are mostly commercial, proprietary, platform-specific ventures. I’d like to put one on the web that is at least “good enough”, and free for anyone to use. I’m open source that way (I’m happy to release the source as well).
Another tool I’d love to see is one that would let me manage my consulting business online. If BestPractical’s RT had a good PayPal plugin that would let you charge per ticket or charge for a bundle of so many tickets or something, that’d be a good start, but I’ve mucked with the code for RT (it’s written mostly in Perl), and it wasn’t a pleasant experience. This wouldn’t be a complete solution either, because most of my work is *not* simple support tickets, it’s large projects. For those I’d like people to be able to pay invoices online. There’s lots more I’d like to add on top of that, but that’s the general gist of it, and in the past I’ve been unable to find a really good solution, where “really good” is a completely nebulous term barely defined in my own head.
In addition to those ideas, I registered a couple of domains over the past year, and I hope to do some cool things with them as well if I ever get some time away from work and consulting. Oh yeah - I’ll also continue working on loghetti! Keep any eye out for updates. Maybe some people reading this have similar interests and would like to collaborate. Ciao for now!
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F04%2F21%2Fspring-means-blooming-flowers-and-ideas%2F'; addthis_title = 'Spring+Means+Blooming+Flowers%26%238230%3B+and+Ideas'; addthis_pub = 'jonesy';A non-degree-holder’s view of hiring decisions
I get a good number of job offers without sending resumes around. I guess my name shows up in enough places, associated with enough buzzwords, that recruiters fire off emails first and read the fine print later. The “fine print” in my case, says that I do not have a college degree.
99.999% of the time, recruiters, and even hiring managers, tell me that my experience more than makes up for any lack of a formal education (one manager said he had seen many less capable MS degree holders). However, there are a few little quirks I’ve found at some larger companies. Mainly, they fall into two categories:
- They just plain don’t hire anyone without a degree
- You can’t get past a certain “tier” of employment without a degree
I’ve worked in business. I grew up in family businesses. I understand that, in certain circumstances, corporations can have legitimate reasons for these stances. Probably the only one I’ve ever actually heard myself that seemed almost reasonable is “insurance”. Some positions in some companies can have a drastic effect on things that directly affect the bottom line of that company, and if the company has insurance to protect them against extremely costly one-time errors (like E&O insurance), the insurance company might give them a better rate if they take steps to decrease the likelihood of such errors… like requiring that employees in these positions have a degree. I think it’s kind of a twisted logic, really. Instead of developing processes and procedures to reduce the likelihood of a problem, they think that hiring someone with a degree by itself will help the issue. Like degree-holders are less prone to errors due to the simple human condition. Odd, that.
Oh, and there’s a third quirk, but not with the corporate policies - with he hiring managers themselves. The quirk is that certain hiring managers, without regard for stated policy, won’t hire someone who doesn’t have a degree, presumably because they fear they might be fired for hiring someone who fails to produce because they don’t have a degree. The other possible reasoning here is that they have the attitude that “I went through it, so why should I give someone a job who hasn’t?”
The *real* problem with these hiring managers, and with corporations who have (non-insurance-related) strict educational requirements of applicants, is that that there’s a shortcoming in the business education curricula: they don’t teach the future middle managers of the world how to evaluate an applicant who doesn’t have a traditional, formal education.
This is a guess, of course, since I haven’t been to business school. But aren’t managers unwilling to hire those without formal educations also guessing? I would submit that they are. It’s the same kind of guess, too. It’s a guess based in part (maybe) on experience, and in part based on stereotypes or other preconceptions.
My experience with those who don’t, or won’t hire non-degree holders is that they think of degree-holders as “more well-rounded”. Assuming the non-degree holder hasn’t resigned themselves to a life of flipping burgers, I don’t think this could be further from the truth. It is, in fact, an old wive’s tale with no basis in fact. We were all told as kids that college would make us more “well-rounded”, and so we all worked to attain this nebulous goal. In reality, a college degree, by itself, is simply not any kind of valuable indicator of “well-roundedness”. Colleges are businesses. They produce college graduates. They do it efficiently, with an eye toward the business end of things more than anything else. If a college graduate is well-rounded, it is as much in spite of their college experience as because of it. Most well-rounded people are probably predisposed to being well-rounded, and had a tendency toward things to help them become well-rounded by the time they arrived on campus.
Besides this somewhat lame view of non-degree holders, another assumption is that non-degree holders do not have *any* education, and so *cannot* be prepared to perform the tasks that a graduate can (allegedly) perform. This argument might hold water with me if I didn’t have some idea already how resumes are typically handled by HR departments. The short story there is that there are tons of resumes that a hiring manager never sees because they’re pre-qualified (read: filtered) on the basis of educational status.
My area of expertise is technology. I don’t have a degree. It would therefore be assumed by many a hiring manager that I have no idea what Big-O notation is, don’t know anything about object delegation or polymorphism, and can’t analyze problems the same way as a college grad. The manager would be wrong on the first two counts, because while I didn’t study in college, I *did* study. But what about that third bit about analyzing problems?
I can tell you that it’s absolutely true that I do not analyze problems the same way as a college grad. What’s a real shame, though, is that a lot of managers would assume that “not the same” means “not as well”. There’s no justification for this assumption. In fact, I would argue that it *has* been the case in the past that having one rogue non-degree holder in a room full of grads can help to avoid “group think”, and help the group turn a problem sideways for another look. It is unfortunate that a degree that is supposed to help people “think outside the box” seems to put everyone in the same exact spot outside of that box, looking at it from the same exact perspective, coming to the same exact conclusion.
Finally, there is a certain class of degree-holder that I think is never a win over hiring a young, hungry rogue like myself. This class of graduate has hung their degree on the wall and decided that they no longer have any obligation to continue to keep up with new developments in their field. They code the same way they’ve always coded, use the same collection of old trusty tools, deal with technology the same way they’ve always dealt with technology, and stood more or less completely still, failing to seek out (much less embrace) new tools, techniques, languages, paradigms for getting things done. How can you possibly think outside the box when your vision of the box is 10 years old and assumes that the box is completely static?
I believe it was Nietzsche (sp?) who wrote that truth is not static (of course, I’m paraphrasing, and I might be thinking of James). If you can see yourself subscribing to that idea at all (it seems counterintuitive at first glance, but deeper thought will probably get you there), then how can a person with a notion of “truth” that is tied to their college experience be any better at figuring out what to do with it than someone who doesn’t have a degree, but is forever seeking out interesting things that come out of an ever-evolving truth?
Anyway, that’s my diatribe for the evening. If you’re a hiring manager with preconceived notions about college degree holders (or not) that come from decades of brain-hammering by graybeards, then cling to that safety blanket all you want, but know that it’s old thinking. Learn to be (gasp!) creative about how you evaluate applicants, and how you build your teams, and how you execute on your visions. Try to find the other box. The one that doesn’t look anything like it did in college.
I’m interested in hearing feedback on these ideas. I’m sure some will take offense. I don’t mean any. I’m certainly not saying that not having a degree is better, or that degree-holders are all complacent or anything like that. I *am* saying that *formal* education *can* be an irrelevant point of comparison, and that relying solely on the existence (or not) of a *formal* education as the basis for hiring one applicant over another is ludicrous.
Also, my blog is subscribed to by various sites, and I decided to publish this to all of the categories, because I think it *could* be interesting to pretty much anyone. If this is spam in your eyes, let me know. If you find a lively discussion about this going on anywhere, I’d be really interested in that as well ![]()
Amazon Adds Static IP and “Availability Zones” to EC2
This is cool. You can now associate a static IP address with your EC2 instances. No more mucking about with 10-minute DNS timeouts or dynamic DNS routines. You can also elect to start certain instances in multiple locations using “Availability Zones”
These new features will make it a little easier for people to deploy larger web sites and services without quite as much management overhead. There’s also some rumblings in the forums that Amazon is actually working on immutable storage for EC2 images, which would pretty much complete the puzzle for most. A good bit of the custom scripts and routines people come up with for running on EC2 is to get around this “limitation”, although, truth be told, a good part of dealing with that is having a good backup plan, which you really should have anyway - EC2 just forces the issue ![]()
Hadoop, EC2, S3, and me
I’m playing with a lot of rather large data sets. I’ve just been informed recently that these data sets are child’s play, because I’ve only been exposed to the outermost layer of the onion. The amount of data I *will* have access to (a nice way of saying “I’ll be required to wrangle and munge”) is many times bigger. Someone read an article about how easy it is to get Hadoop up and running on Amazon’s EC2 service, and next thing you know, there’s an email saying “hey, we can move this data to S3, access it from EC2, run it through that cool Python code you’ve been working with, and distribute the processing through Hadoop! Yay! And it looks pretty straightforward! Get on that!”
Oh joyous day.
I’d like to ask that people who find success with Hadoop+EC2+S3 stop writing documentation that make this procedure appear to be “straightforward”. It’s not.
One thing that *is* cool, for Python programmers, is that you actually don’t have to write Java to use Hadoop. You can write your map and reduce code in Python and use it just fine.
I’m not blaming Hadoop or EC2 really, because after a full day of banging my head on this I’m still not quite sure which one is at fault. I *did* read a forum post that someone had a similar problem to the one I wound up with, and it turned out to be a bug in Amazon’s SOAP API, which is used by the Amazon EC2 command line tools. So things just don’t work when that happens. Tip 1: if you have an issue, don’t assume you’re not getting something. Bugs appear to be fairly common.
Ok, so tonight I decided “I’ll just skip the whole hadoop thing, and let’s see how loghetti runs on some bigger iron than my macbook pro”. I moved a test log to S3, fired up an EC2 instance, ssh’d right in, and there I am… no data in sight, and no obvious way to get at it. This surprised me, because I thought that S3 and EC2 were much more closely related. After all, Amazon Machine Images (used to fire up said instance) are stored on S3. So where’s my “s3-copy” command? Or better yet, why can’t I just *mount* an s3 volume without having to install a bunch of stuff?
This goes down as one of the most frustrating things I’ve ever had to set up. It kinda reminds me of the time I had to set up a beowulf cluster of about 85 nodes using donated, out-of-warranty PC hardware. I spent what seemed like months just trying to get the thing to boot. Once I got over the hump, it ran like a top, but it was a non-trivial hump.
As of now, it looks like I’ll probably need to actually install my own image. A good number of the available public images are older versions of Linux distros for some reason. Maybe people have orphaned them and gone to greener pastures. Maybe they’re in production and haven’t seen a need to change them in any way. I’ll be registering a clean install image with the stuff I need and trudge onward.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F20%2Fhadoop-ec2-s3-and-me%2F'; addthis_title = 'Hadoop%2C+EC2%2C+S3%2C+and+me'; addthis_pub = 'jonesy';The Power of Open Source
I think my very favorite aspect of the open source development model is that it allows me to practice the philosophies I use in my every day personal life, and apply them to software development as well. In my teens and early 20’s I read quite a lot of Aristotle and Plato, and a very major philosophy that I took away from all of that reading is “be conscious of your own ignorance”. And so I am.
There are just about a million reasons to start an open source project. In the case of loghetti, I made it a project because I know that there are things that other people know, which I do not know, but would probably like to know or benefit from knowing (we’ll not go into epistemological discussions - I’m just going to use the word “know” in the traditional sense here)
Turns out, just knowing that there’s stuff out there that I don’t know has proven useful. Within hours of launching the Google Code site for the project, Kent Johnson joined the project, changed maybe 5 lines of code in the apachelogs.py module, and according to my testing, that change resulted in a 6x speed increase. If you’re using loghetti from the SVN trunk, it’s gone from being sluggish for anything over 50MB, to being pretty darn quick even up to 250MB, at least for simple queries like –code=404 (which is what I do speed comparisons with). The changes will be in a tarball probably some time next week, for those who don’t want to use svn.
We haven’t even touched threading yet ![]()
Loghetti is now an open source project
I was getting feedback about loghetti, and it was all very useful, and it’s still coming in, and I can’t work full-time on it. At the same time, I’d love for some of the stuff I’ve read about to be implemented, because I certainly could make use of it myself.
So if anyone is interested, you can get loghetti, get more info about loghetti (it’s an apache log filter written in Python), or join the project here.
addthis_url = 'http%3A%2F%2Fwww.protocolostomy.com%2F2008%2F03%2F18%2Floghetti-is-now-an-open-source-project%2F'; addthis_title = 'Loghetti+is+now+an+open+source+project'; addthis_pub = 'jonesy';