Sometimes you throw around a programming buzzword for a long time without knowing what it really means. You kind of know what it means on the surface, and you kind of know what it does, but you don’t know the deep knowledge of what it does, how it works, and how it’s useful. I have found that once you have a little bit deeper knowledge of something you don’t fully understand, you start to see patterns in things.
Sometimes you will have an aHa moment of clarity, like wearing a snorkel mask underwater for the first time. You feel empowered, and best of all, you can now completely bore an entire group of friends at parties while looking like an intelligent know-it-all. Okay, maybe that’s not the best part, but considering computers seem like magic boxes to most people, it can be rewarding to show someone else that computers are just really dumb adding machines made by really smart (lazy?) people to do really repetitive things really, really fast. Yes, really!
What is Base 64 Encoding (Does it Involve Math)?
The short answer is that
Base64 can be a code obfuscator, a potential a code-envelope, or an digit shrinker. The good news is that although it does involve math, there are plenty of online tools to do all the repetitive boring stuff (what computers are good at). First, let’s looks at what is so magical about 64.
What’s so magical about the number 64?
Nothing. Nothing is special about 64. You can do Base-26 encoding, Base 52-encoding, or even Base2 encoding (1s and 0s). The base number is just the number of unique symbols that you can use. 64 is a pretty efficient number as it’s a power of 2, which is what most computers do all their calculations in. Technically, you could just choose 64 characters, but the Unicode character set for Base64 is:
The reason these characters were chosen is probably obvious, but in case you didn’t realize, they’re all human readable characters, there are no funny characters like
[space] so they can be printed to the screen, and there isn’t any special characters like
@^%$* to cause problems. Let’s start with an example and disect it.
How does Base 64 encoding work?
Now that we kind of understand what Base 64 is, can we figure it out? Is it some esoteric computer-y thing that we can’t do by hand? To answer simply… sort of, but not really. Understanding how it works does involve a bit of math, but I’ll try to explain it in detail.
Encoding a Simple Example:
Let’s say we have a piece of text that we want to encode into Base64. We might want to obfuscate some code to it’s hard to read and mess with. Maybe we want to take some binary data and make sure it’s put some information into a URL, We’ll discuss why you’d do that later, but for now, let’s just encode and decode. Our sample that we want to encode into Base64 is the word
BEEF. It could be anything, really.
Using an online Base64 encoder is usually the easiest way, although pretty much ALL major languages have this ability built in. You can search the web, but the one I’ll use base64encoder.org. I went to the encoder tab and the result is:
You can ignore the
== at the end as it’s output filler. But what exactly happened? How did we go from
BEEF to QkVFRg==
and can we do this manually?
How to Manually Decode BEEF?
The first thing we need to do is break down our word
CODE into binary. The letter C in
UTF-8 is represented by the number 67 in the ASCII table. I won’t go into why, but here’s a link to the ASCII table if you’re curious. Once we convert 67 to binary (base2) it becomes
0100 0011. Let’s convert the whole word:
|Binary||0100 0010||0100 0101||0100 0101||0100 0110|
So, BEEF, in binary is essentially string of
0s like so:
01000010010001010100010101000110. To base 64 encode this number, we need to understand the binary number. Each of the letters above, when encoded into binary are 8 characters long. It’s an 8-bit sequence as there are 8 pieces of information. Also known (conveniently) as 1-byte. If you are unfamiliar with how binary and other number systems work, I am writing an article about why developers should care about binary but if it’s not published yet, you can just search the internet. 1-byte has 256 different possibilities (called permutations) from
0000 0000 to
1111 1111. Don’t believe me? Feel free to count them all… or if you want to use math, each space has 2 possible values,
1 and there are 8 spots. 2^8 (2×2×2×2×2×2×2×2). To base64 encode binary numbers is dead simple. We just split the whole value into groups of 6 instead of 8. Why? We’ll get to that in a minute, but first, let’s split up our binary number into groups of 6:
01000010 01000101 01000101 01000110
010000 100100 010101 000101 010001 10xxxx
You probably noticed that we can’t evenly split our binary number into groups of size, so we need to pad that last value. at the end. In case, you’re wondering, that’s what the
== means at the end of our Base 64 encoded string. Each
= sign represents two binary digits, so a 3-letter word wouldn’t need any padding.
So, why do we group the number into 6 instead of 8? Well, if you know binary, you probably already guessed this, but the maximum number that can be represented with 6 bits is: 64! Let’s convert our binary numbers into decimal. We’ll replace the x’s with 0s.
If we consult our Base64 list of valid values, we can see it starts at
A–Z which map to 0–25, then
a–z which map to
26–51 and then
0–9+/ which map to the rest
52–63. So… 16 would be Q, which is exactly what our encoder returned as the first value! Success!
If we manually looked up each position in our Orphan Annie Base 64 secret decoder table we’d find they map the the following letters:
And there you have it! You’ve successfully (and very painfully) encoded a string into Base 64. Pat yourself on the back and treat yourself.
Why Base 64 encode?
So, what’s the point of encoding in Base 64? Well, in our introduction, we mentioned that Base64 encoding is a good way to obfuscate code, be a code envelope, or act as an information transport. We’ll briefly go over these one by one.
This one is pretty straightforward. If you want to make something hard to read, you encode it. Note that this is NOT cryptography in any way, shape, or form. DO NOT THINK that you’re making something secret by encoding it. To a computer, or even to most developers and I would guess almost every hacker, changing something to Base 64 ( or pretty much any other base ) is like trying to hide “John Smith” by writing it as “htimS nhoJ”. It’s like putting a blanket over yourself and standing in the middle of a room during a game of hide-and-seek. It’s like locking a door while leaving the key hanging off the door knob. It’s elementary to decode and pretty easy to recognize. Even if you chose a different base ( like base 58 ), it’s not that difficult to try a few decoders.
However, if you’re trying to make something difficult to guess on first glance, it’s not a terrible solution. Of course, sometimes, you don’t want people to be able to read some code because you’re trying to make it look esoteric and complicated, or you’re trying to hide the code’s intent. This is where encoding can turn some code into a code envelope.
The Code Envelope… Malicious Letter Bomb, or Innocent Transport?
Sometimes you want to take some information and transport it differently. Maybe you don’t want people to tamper with the code ( code obfuscation ) but something you don’t people to understand what was written because it’s malicious code. PHP, for example, has a keyword called
eval. The purpose of eval is to take some text, evaluate it and convert it into executable code. As you imagine, this is a favorite technique of hackers.
For example, you could take the following PHP code and encode it:
Now you’d just need to do the following:
eval(base64_decode("ZWNobyAiQk9PISI7"); to run that code. Pretty neat trick. However, if red flags aren’t going off in your head, they should be. Any code that you can’t read… could be anything at all. It’s one of the ways hackers will use to try to hide code. I recently saw some malicious code that was encoded 5 times over in order to hide the actual code. I write about that in another article about getting hacked and how to identify, search and destroy the malicious code. Unless you have a good reason (maybe to store the code in a database with spaces and tabs and other characters intact) I probably wouldn’t use eval to put your code in an envelope. However, one valid and commonly used reason for encoding is a digit shrinker.
Most databases use integers as primary keys, but the problem with digits is there’s only 10 different symbols to choose from. Yes, the decimal system is a base 10 system. Well, base12 if you include symbols like
- to be precise. But databases don’t use the minus symbol and decimals for IDs, so we have 10 different symbols. 64 is larger than 10, so we have more options! In actual fact, link shorteners often use something like base52( uppercase and lowercase letters ), or a base 36( lowercase letters and numbers ). With 6 positions, you can represent 1,000,000 permutations ( including 0 ) if you use only numbers (10^6). With Base 52, you could represent 19,770,609,664! In fact, you could get 7,311,616 different combinations with just 4 characters!
Learning how Base 64 works, and why it works can help you understand what’s actually going on in under the hood of your code at times. While you may think that it’s esoteric and not useful, often times understanding the core technology will help you make better decisions, as well as identify code smells and potentially dangerous code. You may not need to know and understand how to do the nitty gritty details from memory, but you should definitely care about it.