Skip to content

Wednesday, February 10th, 2010

Regular Expressions are Anything But Regular

November 15, 2009 by Jason Bean  
Filed under Computers

I’ve been spending a bit of time the last couple of days trying to figure out exactly how regular expressions work. I’ve got to use one in an application I’m working on and I’m just not getting the syntax organization at all.

regex-illustration

Although you may think it’s gibberish, the text above is the regular expression I’m using for the solution to my earlier problem. Doesn’t look very "regular" does it? Aside from the fact it’s using characters I’m familiar with, the order and meaning behind them might as well be hieroglyphics to me. So, what is a "regular expression"?

The Wiktionary website defines it as follows:

A concise description of a regular formal language with notations for concatenation, alternation, and iteration (repetition) of subexpressions.

My task for my application was to exclude any .zip files from being able to be uploaded to the server. I couldn’t ever find a regular expression that excluded .zip files, but I did find the one above that basically allowed a list of other files.

Here’s the expression above again:

^.+\.(([jJ][pP][eE]?[gG])|([gG][iI][fF])|([pP][dD][fF])|([dD][oO][cC])|([dD][oO][cC][xX])|([bB][mM][pP])|([tT][xX][tT]))$

The "regular expression", or "regex" above basically looks for any .jpg, .jpeg, .gif, .pdf, .bmp, .doc, .docx, or .txt file and allows it to be uploaded. The upper and lower case version of the letter within brackets specifies that the file extension could be typed either way.

I’m still not real sure what all the other symbols are really specifying in there. I’ve got more to learn for sure.

  • StumbleUpon
  • Digg
  • Facebook
  • Mixx
  • Google
  • TwitThis
  • Reddit
  • Yahoo! Buzz
  • Slashdot
  • E-mail this story to a friend!
  • BallHype
  • YardBarker

Comments

4 Responses to “Regular Expressions are Anything But Regular”
  1. Graeme says:

    Most programs will let you do things case insensitively so you could shorten it to something like /.+\.(jpg|jpeg|gif|pdf|doc|docx|bmp|txt)$/i

    Even with egrep you can use the -i flag to turn on ignore case.

    I know you’ve got extra bits in there for jpeg/jpg and doc/docx but that should help you along.

    Again, depending on the software you can break the regex out across multiple lines and comment each line so you know what’s going on if you need to make more complex ones.

  2. Eric Martindale (subscribed) says:

    What language are you writing this in? There’s a big chance you can easily slim this down by using a case-insensitive option.

    Also, if you want to exclude zip files, searching for the last couple characters of the filename is not an accurate method. What you actually want to do is read the contents of the file and see if it’s a zip file (or other archive, if necessary). Otherwise, people will be able to subvert your method quite easily by simply renaming the file–but embarking on a coding journey of this magnitude may be more trouble than it’s worth.

    Be wary of techniques that allow users to hide files inside of images.

  3. Jason Bean says:

    Thanks for the input guys. I’ll see if I can work those modifications into my app. I basically am just trying to allow attachments that can be converted into an image file for automatic faxing when an email address doesn’t exist for a user. Thanks for the help.

Trackbacks

Check out what others are saying about this post...
  1. [...] my recent explorations into the world of writing "regular expressions" I came across a couple of different applications that are supposed to make it easier for you to [...]



Speak Your Mind

Tell us what you're thinking...
and oh, if you want a pic to show with your comment, go get a gravatar!


About Us | Advertise with us | Blog for EveryJoe | Privacy Policy | Terms of Use
Get This Theme | Sitemap


All content is Copyright © 2005-2010 b5media. All rights reserved.