Research by: Eyal Itkin
Everyone is familiar with the concept of IoT, the Internet of Things, but how many have heard of smart lightbulbs? You can control the light in your house, and even calibrate the color of each lightbulb, just by using a mobile app or your digital home assistant. The smart lightbulb management is done over WiFi or even ZigBee, a low bandwidth radio protocol.
A few years ago, a team of academic researchers showed how they can take over and control smart lightbulbs, and how this in turn allows them to create a chain reaction that can spread throughout a modern city. Their research brought up an interesting question: aside from triggering a blackout (and maybe a few epilepsy seizures), could these lightbulbs pose a serious risk to our network security? Could attackers somehow bridge the gap between the physical IoT network (the lightbulbs) and even more appealing targets, such as the computer network in our homes, offices or even our smart cities?
We’re here to tell you the answer is: Yes.
Continuing from where the previous research left off, we go right to the core: the smart hub that acts as a bridge between the IP network and the ZigBee network. By masquerading as a legitimate ZigBee lightbulb, we were able to exploit vulnerabilities we found in the bridge, which enabled us to infiltrate the lucrative IP network using a remote over-the-air ZigBee exploit.
Below is a Video Demonstration of this attack:
This research was done with the help of the Check Point Institute for Information Security (CPIIS) in Tel Aviv University.
After we finished our previous research (Say Cheese: How I Ransomwared Your DSLR Camera) we decided to extend our debugger (Scout) to support additional architectures such as MIPS. As the best way to do so is to start researching MIPS, I asked on Twitter for suggestions of a good MIPS target for a vulnerability research.
As is mostly the case, people responded with a few promising leads, and the most promising one was from an old colleague of mine: Eyal Ronen (@eyalr0), who is now in a research position at the CPIIS (Small world, isn’t it?). Eyal Ronen suggested I continue his research on smart lightbulbs (See “Prior Work” in the next section). In their original research, his group was only able to take control of the lightbulbs themselves. He believed it might be possible to leverage this position in the ZigBee network to deploy an attack against the bridge that connects the ZigBee network to the IP network. In essence, this new attack vector enables an attacker to infiltrate the IP network from the ZigBee network, using an over-the-air attack.
In IoT Goes Nuclear: Creating a ZigBee Chain Reaction, a team of researchers led by Eyal Ronen (@eyalr0), Colin O’Flynn (@colinoflynn) and Adi Shamir, analyzed the security aspects of ZigBee smart lightbulbs. More specifically, they focused on the Philips Hue bridge and lightbulbs, showing a series of exploits:
By combining these 3 demonstrated attacks, the researchers argued that by taking control of a chosen subset of lightbulbs in a smart city, they could trigger a nuclear-like chain reaction that could eventually take control of all the lightbulbs in the city.
Due to the nature of the attacks, the vendor was only able to block the second attack, thus leaving us with the capabilities to:
After receiving a detailed explanation of their original research, and armed with a Philips Hue Bridge that Eyal R. managed to salvage from their lab, we were ready to begin this promising new research.
According to Wikipedia, “ZigBee is an IEEE 802.15.4-based specification for a suite of high-level communication protocols used to create … low-power, low data rate, and close proximity wireless ad hoc networks.” Not to be confused with IEEE 802.11 (WiFi), according to the OSI model, IEEE 802.15.4 is the technical standard for the radio-based network protocol which acts as layers 1-2 of the ZigBee network stack.
Just to get a sense of this low data-rate protocol, the maximal transmission unit (MTU) for a frame in the underlying MAC layer of IEEE 802.15.4 is 127 bytes. This means that unless fragmentation is used, the messages of the ZigBee network stack are very limited in size. Hopefully, this limitation won’t restrict us too much in finding, and later on exploiting, vulnerabilities in the ZigBee implementation.
On top of the narrow radio network layer, ZigBee defines a full stack of network layers, as can be seen in this figure taken from (an older version of) the ZigBee specs:
Figure 1: ZigBee network stack outline.
In short, we can roughly divide the network stack into 4 layers (in ascending order):
ZDP = ZigBee Device Profile
ZCL = ZigBee Cluster Library
For those of you who are familiar with the SNMP protocol, ZCL looks like a different encoding of the same logical interface. The ZCL layer allows devices to query (READ_ATTRIBUTE
) and set (WRITE_ATTRIBUTE
) a collection of configuration values (clusters), which ultimately allows the operator (the bridge) to control the lightbulbs. For example, attributes for the Color Control cluster include:
This example also shows that these are not ordinary white/yellow lightbulbs. These smart lightbulbs support a wide range of colors, which can be controlled using an (RGB) color palette.
Our target for this research is the Philips Hue line of products, and more specifically, the Philips Hue Bridge. As a side note: the Hue line of products originated in the Philips-Lighting division of Philips and is now branded under a third company called Signify.
While “smart” lighting solutions aren’t that popular yet in Israel, we found this isn’t the case in many other countries. For instance, this article from 2018 states that Philips Hue dominates 31% percent of the smart lighting market share in the UK, used by over 430,000 households. In fact, when we presented our research results to some of the VPs in our company, they told us that all the lights in their house are from the Philips Hue brand.
The following graphic, taken from the original research paper, shows the network architecture for a home or office that uses this product:
Figure 2: ZLL (ZigBee Light Link) architecture.
ZLL is an acronym for ZigBee Light Link, which is a customization layer to the ZigBee network stack that focuses on light devices: both the lightbulbs and the bridge that controls them.
On the one hand, we have the ZigBee devices: lightbulbs, switch and the bridge. And on the other hand, we have the IP devices in the “regular” computer network: our mobile phone, a router and again, the bridge. As is inferred by his name, the bridge is the only device that is present in both networks, and its role is to translate the commands we send from the mobile app into ZigBee radio messages that are then sent to the lightbulbs.
We already knew that the bridge uses a MIPS CPU (that’s why we originally chose it), but it turns out that its architecture is even more complex. In Figure 3, we show the board of the bridge (2.0 model) after we extracted it from the plastic case:
Figure 3: The electric board of the bridge hardware model (2.0).
From this point on, we refer to the Atmel CPU as the modem. This is mainly because the main CPU offloads the handling of low level ZigBee network tasks to be performed solely on this processor. This means that both the physical layer and the NWK layer are handled by the modem, which in turn might query the main CPU for needed configuration values.
To our surprise, the main CPU runs a Linux kernel and not a real-time-operating-system. This turned out to be quite useful when we had to extract the firmware and debug the main process responsible for the core logic of the bridge.
On his website, Colin O’Flynn (@colinoflynn) describes how to connect to the exposed serial port and gain root privileges on the board. This is a great guide to anyone who deals with embedded Linux devices, and specifically deals with the U-Boot bootloader. Unfortunately, I didn’t have the necessary equipment to connect to the serial interface, which I discovered after I repeatedly failed to reproduce Colin’s results. Fortunately, I consulted my little brother who helped me out and told me which serial cables I needed to order. And so, we started reverse engineering the old firmware version (from 2016) I received from Eyal R. while I waited for the cables to arrive.
The core process in the main CPU is the ipbridge
process. A basic recon shows it is a classic case of an ELF target:
*.so
files), stack and heap are randomized by the operating system.This is a somewhat mixed state that we often see when dealing with targets running Linux. The operating system enables some security features by default, and usually the vendor doesn’t try to actively enable additional features such as PIE (Position Independent Executable) or even stack canaries. From our perspective as attackers, the exploitation won’t be easy as there is some ASLR (Address Space Layout Randomization) in place, but it is still possible because there are some fixed known memory addresses we can use in our exploit.
Before we started reverse engineering the process, we noticed that the disassembler had trouble distinguishing between Mips and Mips16 code sections (similar to the Arm and Thumb case in an ARM firmware). This was a good time to test if Thumbs Up, originally tested only on Intel and Arm binaries, also produces improved analysis in our Mips binary. Luckily for us, it worked quite well: initially we had 2525 functions, and after the execution we had a cleaner binary with 3478 marked functions. Now we started reverse engineering our binary without a need to manually improve IDA Pro’s analysis.
Immediately after we started the reverse engineering phase, we saw something odd. For some reason, it looks like we expect our messages to arrive in a textual form?!
Figure 4: Command strings to look for in the incoming messages.
In Figure 4, we can see the list of strings we expect to find in the incoming message. Each string routes our message to a specific handler function, such as the function we named EI_zcl_main_handler
. At this point, we checked the ZigBee specs again, as it made no sense. The protocol should be binary, and with a really low bandwidth, why does our program think it should receive long strings?
After reading the conclusions from Eyal R. and Colin once more, it suddenly came clear. The modem has an additional role that we initially ignored: it translates the binary messages to a textual representation, and then sends them through a USB-to-Serial interface. This way the main CPU reads the easy to handle textual messages from a serial device that is mapped as a file in the operating system.
Colin found evidence that the lightbulbs use the Atmel BitCloud SDK, which is now closed source and must be purchased from Atmel. Therefore, it makes sense to assume that the same software stack is also used as a “decoder” layer in the modem CPU in the bridge:
This way, the main CPU only needs to be familiar with logical aspects of the ZigBee stack, but doesn’t need to implement complicated decoding and parsing features that are already included in the stack that is shipped with the Atmel modem.
From a security perspective, this design choice has its pros and cons. As far as we are concerned, it has a massive implication. We only have the firmware for the ipbridge
process, which we can also debug using a remote gdbserver we compiled and placed on the bridge’s file system. The firmware for the modem is encrypted and it will not be easy to recreate the steps from the original research to extract this key (using a power analysis attack) and decrypt the modem’s firmware.
This means that we can only treat the modem as a black box that performs a lot of parsing, and maybe even holds a few state machines. We have a few hints from the partial code version that is found on GitHub (that is a few years old), but for all intents and purposes it is simply a black box that can block some of our attack attempts if they demand we send malformed messages.
Nothing about this research is going to be easy, and so, we just add this new obstacle to our list and continue on.
Now that we understood why the modem sends us textual messages into a serial device, we tracked down the flow of the messages between the different threads, and started looking for vulnerabilities in each of the different handlers. Our efforts focused on the ZCL handler, as it supports read/write operations on a wide variety of data type attributes:
E_ZCL_UINT32
(0x23) – 4 bytes integer.E_ZCL_UINT16
(0x21) – 2 bytes integer.E_ZCL_UINT8
(0x20) / E_ZCL_ENUM8
(0x30) – 1 byte integer / 1 byte enum value.E_ZCL_BOOL
(0x10) – 1 bit Boolean value (True/False).E_ZCL_ARRAY
(0x48) – variable length byte array, using a 1-byte length field.As you can probably understand, handling variable length fields in an embedded device is a sure recipe for vulnerabilities. Figure 5 shows the assembly code that handles this case:
Figure 5: Assembly snippet for the vulnerable handling of array data types.
Note: Bear in mind that the MIPS architecture uses a delay slot, so on the call to malloc()
, the value 0x2B is passed as an argument inside the delay slot in the instruction: li $a0, 0x2B
. This can be a bit confusing for anyone reading MIPS assembly for the first time.
What did we find? An attacker could send a malicious response for a READ_ATTRIBUTE
message, containing a malformed byte array that is bigger than the fixed size of 43 bytes (0x2B). This triggers a controlled heap-based buffer overflow, without any byte restrictions.
Possible limitations to this potential vulnerability:
This is not exactly the easiest vulnerability to exploit, but it’s a serious vulnerability nevertheless.
In an instance of good timing, our serial cables finally arrived and we immediately started checking if we had indeed found a vulnerability. We compiled a gdbserver and placed it on the bridge’s file system, and now encountered a new obstacle: we don’t have a transmitter with which to send our attack. After another consult with Eyal R., we bought the evaluation board of the lightbulb’s CPU, exactly as his team did in their research.
Meanwhile, we found a hack that allowed us to verify the existence of this vulnerability even without transmitting a radio message over the air (hoping that the modem won’t block us later on). The ipbridge
process supports a debug testing mode that is activated by connecting to two named pipes that the process listens on using a debug thread: /tmp/ipbridgeio_in
and /tmp/ipbridgeio_out
. While these debug capabilities aren’t really helpful, we patched the binary so that messages that arrive through these pipes are added to the message queue as if they arrived from the modem itself.
Using this small binary patch, we were able to create our own process that connects to the named pipes and sends (textual) messages aiming to hit the vulnerable code function. After some trial and error, and using our debugger, we were able to trigger the vulnerability and prove it exists. The only caveat is that the modem can still block it, and this requires us to transmit the attack over radio.
While waiting for our transmitter, our full Philips Hue starter kit arrived with a brand new 2.1 model bridge and 3 lightbulbs. This looked like the right time to extract the new firmware from the bridge, together with updating the 2.0 bridge to the latest firmware. After all, up until now we worked on firmware from 2016, and things might have changed in the meantime.
Sadly, things did indeed change.
The first thing we noticed about the new firmware is its size. For some reason, the ipbridge
ELF file grew from 1221KB to 3227KB. Opening it in IDA Pro showed us the main difference: the binary was (accidentally?) shipped with debug symbols. This is great news that can really help us in our reverse engineering attempts. Figure 6 shows some of these symbols:
Figure 6: Function symbols of the new firmware.
Using this new discovery, we learned that our initial reverse engineering was relatively accurate, and the name of the vulnerable function turned out to be: SMARTLINK_UTILS_ReadAttributeValue
.
When analyzing the vulnerable function in the new firmware version, we had an unpleasant surprise. The list of supported data types was updated, and now the vendor supports character strings (0x42) instead of byte arrays (0x48). Although strings are still variable in length, the allocation now changed to be more appropriate to null terminated strings:
L
) is read from the incoming message.L + 1
is allocated.A fixed heap buffer is no longer used, and this change of supported data types just closed our vulnerability. Time to search for a new one.
We put the ZCL module aside and eventually found our way to the ZDP module, more specifically, to the handler of incoming LQI (Link Quality Indicator) management responses. These messages are part of a module that is responsible for neighbor discovery. Periodically, the bridge queries the lightbulbs for their known neighbors in the ZigBee network. While the name suggests that the messages are focused on the quality of the radio transmission, the message structure is actually focused on the full set of network addresses for each neighbor.
The context for each neighbor, as seen in these messages:
As both parties need to tell each other about a variable number of neighbors, which can include up to 0x41 supported records in the ipbridge
global neighbor array, these messages include a fragmentation format. In each response, the lightbulb tells the bridge that it is currently answering with L
records, from offset X
to offset X + L - 1
, out of possible S
records.
As you may recall, the message sizes in the ZigBee stack are quite small, so using so many indices in each message, and sending multiple records of 16 bytes each, really limits the number of records that can be included in each message. As a result, the developers store the incoming records on the stack in an array that can hold up to 6 records. However, there are no checks in place to make sure that the incoming length field is indeed small enough, leading to a potential stack-based buffer overflow.
You might wonder how we are planning to transmit such a “huge” message and overflow the buffer. Due to the physical limitation on the message sizes over radio, our only hope is to find a vulnerability in the modem, and then use this stack-based overflow to hop from the modem and into the main CPU. This means that even if we just found a vulnerability, it could only be exploited using an additional vulnerability in an additional CPU for which we don’t even have the firmware. Not exactly a great plan, but in the absence of anything else…
Before starting such a daring move, we once again used our hack to inject packets, and tried to trigger a controlled stack-based buffer overflow to check the exploitability of this new vulnerability. Unfortunately, the return address on the stack lies exactly in an offset that we don’t fully control when overflowing. Our overflow occurs by parsing incoming fields and placing them in a local struct. It turns out that we can only overflow the return address with the value 0x00000004
.
Verdict: Not exploitable. At least this saved us the need to try and look for vulnerabilities in the modem.
Figure 7: Missing check in LQI message handling, together with the verdict – not exploitable.
Side note: The maximal number of records that is allowed in the BitCloud SDK is 2. The ZigBee protocol uses multiple indices in a fragmentation message that can only carry up to 2 records in each message. It’s not exactly efficient, to say the least.
Happily enough, 3 turned out to be our lucky number. After we finished auditing the code for all of the different message handlers, we had an intriguing question: When we send ZCL attributes, who handles them after the initial (no longer vulnerable) parsing?
While trying to answer this question, we found a new thread named applproc
. This thread reads the structure that includes our parsed attribute, checks an unknown state-machine check, and if we are fortunate, delivers our message to the CONFIGURE_DEVICES_ReceivedAttribute
function. Figure 8 shows the assembly of this function:
Figure 8: Assembly of function CONFIGURE_DEVICES_ReceivedAttribute
.
For some unknown reason, an opcode is extracted from the incoming struct:
strdup()
.When we went back to check how this structure is initialized, we saw this snippet:
Figure 9: Using the value 0x10 when handling an incoming string, thus creating a type mismatch.
It looks like the transition from supporting arrays to strings was done only halfway, as the string is marked by mistake as “array” using the constant 0x10 instead of 0x0F. This means that once again we have a heap-based buffer overflow vulnerability, and we were able to trigger it using our hack alongside a slight modification to our previous PoC.
Now that we have a vulnerability, one that still depends on an unknown state machine check we need to pass, it is a good time to unpack the arrived transmitter and try to trigger the vulnerability over the air. In the next chapters, we describe the exploitation process for this vulnerability, together with the ZigBee obstacles we discovered and overcame in the process.
It is important to note that we specifically chose the ATMEGA256RFR2-XPRO evaluation board for multiple reasons:
Surprisingly enough, the first point turned out to be a crucial one, but we discuss this part later on.
You might expect that when you buy an Atmel product that comes with a Visual Studio based IDE called “Atmel Studio”, it’s easy to create a sample ZigBee project that simply sniffs messages and prints them to the output/serial. Sadly, this wasn’t the case. After some Googling, we found that Atmel provides a series of useful YouTube tutorials, like this one, in which a man sailing on a boat (we’re not kidding) tells you how to use the extension manager and download a package that allows you to create sample Wireless projects. This is exactly what we initially looked for.
Now that we were able to sniff some messages, we paired a lightbulb with our bridge (a process called “commissioning”), and printed the messages to the serial output. At this point, we realized that while we now have some recorded messages, we don’t really have a proper way to parse them into a human-readable format. We tried a variety of open source Python scripts for ZigBee, but none of them were really useful. We did manage, however, to load the hex-dumped packets into Wireshark, using the following encapsulation type shown in Figure 10:
Figure 10: Wireshark encapsulation type for IEEE 802.15.4 (Zigbee) messages.
Important note: Wireshark fails to analyze the messages if they have an invalid FCS (Frame Check Sequence) field. When we transmit messages, this field is automatically calculated and added by the antenna. Therefore, we recommend that you drop this field from incoming messages, and pick in advance the encapsulation type that tells Wireshark that the FCS field is not present. This makes it easier to analyze dumps of incoming and outgoing messages.
Even a short glance at the dumped conversation taught us that a few things are missing:
As we mentioned earlier, the decision to use our specific evaluation board proved to be crucial. The protocol transmits messages so quickly that the baud rate of the serial interface causes critical delays. In short, in the time we print the messages / send them to our PC, we miss important messages from the ongoing conversation. We have to implement the entire exploit on the evaluation board (in C) if we want to have even the slightest chance of keeping up with the fast pace of the messages in the ZigBee protocol.
Meanwhile, we buffered the messages on the board itself, and sent them to the PC only when our buffer was full. This enabled us to record most of the messages. However, we still missed a few when both the lightbulb and the bridge transmitted together during short periods of time.
Wireshark supports the option to decrypt the ZigBee messages and analyze their decrypted payload, but you must supply it with the proper key. This was a good time to read about the protocol and learn how its crypto design works.
In short, each device uses two important keys:
The vast majority of messages should be only encrypted by the network key, and the transport key is only used when distributing the network key to a lightbulb during the commissioning phase. Which brings us to the immediate problem: we need the transport key, otherwise Wireshark won’t tell us what our network key is.
Figure 11 shows the sample ZigBee .pcap
file from Wireshark’s website, and the Transport Key
message is highlighted:
Figure 11: Sample ZigBee recording, the Transport Key
is encrypted and shown as: APS: Command
We can see that since we don’t have the transport key, we can’t decrypt the Transport Key
message, and it is merely shown as a generic APS command
.
Although we found multiple keys when researching the topic, none of them worked. It seems we were not the first to tackle this issue, as eventually we reached this blog post in which the author details the solution to the problem. It turns out that the “regular” keys are used in a “touchlink commissioning”, but our “classic commissioning” uses a different secret key. Fortunately, both of the keys are included in the article, and they indeed worked. This time we managed to successfully decrypt the message inside Wireshark. Figure 12 shows the decrypted message:
Figure 12: A decrypted Transport Key
message, containing the Network Key.
Note: We deliberately chose to include the actual network key in this image. Later on we also include a link to a full .pcap
recording of the entire commissioning conversation with our model bridge.
When implementing the crypto layers on our evaluation board, we relied on the excellent implementation from Wireshark’s ZigBee dissector, found on GitHub.
After we found the network key with which the lightbulbs and the bridge encrypt all of their messages, we can try to craft our own hostile ZCL message and check if it triggers our breakpoint in the vulnerable function. After a few rounds of trial and error, we had some good news and bad news:
The check is shown in Figure 13:
Figure 13: Some state machine check that blocks our attack.
Initially, it looked like we might need a minimum number of 2 or 3 lightbulbs in the network, but this didn’t work either. After diving back into the code, we learned that function checks if the lightbulb that sent the message is currently undergoing a commissioning process.
Conclusion #1: The vulnerable function is only reachable when commissioning a new lightbulb into the ZigBee network. Legitimate participants in the ZigBee network can’t trigger the vulnerability we wish to exploit.
“Classic Commissioning” is the process of pairing (commissioning) a new lightbulb into our ZigBee network using the standard mobile app. In our case, we used the Philips Hue app from the Android Play Store.
Surprisingly, while there are many documents and specs that describe the messages in the ZigBee protocols, we failed to find a proper document that describes the flow of messages during the commissioning process. Therefore, we merged two approaches, and hoped that eventually we would implement enough messages and convince our mobile app that the bridge really discovered a “new” lightbulb. The approaches are:
Conclusion #2: The bridge won’t accept new lightbulbs into the network, unless the user actively ordered it to search for new lightbulbs. This is a good design choice that significantly lowers the attack surface on the bridge itself. In our attack scenario, we have to somehow trick the user into pressing this button in his app.
This also means that before each experiment, we had to press the button in the app (giving us a grace period of 1-2 minutes), and only then execute our program from the evaluation board. This was the case when we were trying to learn about the messages in the commissioning phase, and also when testing the exploit. Not exactly a smooth automatic procedure, but it worked eventually.
As we promised earlier, here is a link to a full .pcap recording of the classic commissioning with our model bridge, up to the point where the mobile app notifies us about a new lightbulb. The messages are stripped of their FCS field, and the pcap doesn’t contain the IEEE 802.15.4 Ack
messages, as they are sent as an acknowledgment after almost every message.
Implementation Note: There are multiple strict timing restrictions in the ZigBee protocol and in the bridge’s modem, making the entire conversation extremely unreliable if not timed correctly. This means that we must acknowledge incoming messages very quickly, a restriction that was impossible in our custom implementation of the ZigBee network stack. Therefore, we configured our evaluation board to automatically acknowledge incoming messages in its MAC layer. This change had a significant downside: it means that we can no longer sniff messages in promiscuous mode.
Figure 14 shows our crafted new lightbulb, as it’s shown in the app if the user requests full details.
Figure 14: Our crafted lightbulb, as seen in the Philips Hue mobile app.
As you can see, there are multiple controllable string fields that are exchanged during the commissioning phase. We chose to label our new lightbulb as a brand new Check Point Research lightbulb, model “CPR123”.
The commissioning phase can be divided into 4 main parts:
Device Announce
message.READ_ATTRIBUTE
requests to learn about the specs of the lightbulb.Only during the ZCL Phase can we start sending our malicious ZCL messages, in an attempt to trigger the heap-based buffer overflow that we found earlier. We can send malicious response messages regardless of the actual requests that are issued by the bridge, but we can only start sending them after we reach this phase in the commissioning process.
We decided to tackle our problems one at a time. Our first goal was to succeed in exploiting the heap-based vulnerability and jump to an arbitrary memory address, and later on discover where to jump. This turned out to be the wrong decision, as the heap varied a lot based on the messages with which we placed the shellcode in the target’s memory.
The first thing to do when exploiting a heap-based buffer overflow is to check which heap implementation is used by the target. In our case, the target uses uClibc, which stands for “micro LibC”. The exact version was clearly listed in the library’s file name: libuClibc-1.0.14.so
. With a few different heap implementations that are supported by this library to choose from, we easily spotted in the binary the use of the “malloc-standard” implementation, which is based on dlmalloc.
For a small LibC implementation whose prime target consists of products with constrained memory and CPU resources, the implementation is quite straightforward:
Fast-Bins
store small-sized free buffers using a singly-linked list.For a “standard” dlmalloc
implementation, this is the meta-data used by this heap implementation:
Figure 15: The malloc_chunk
structure used in our heap implementation.
Notes:
Fast-Bin
, the third field is also used, and points to the next node in the Fast-Bin
’s linked list. This field is located at the first 4 bytes of the user’s buffer.Fast-Bin
, both the third and fourth fields are used as part of the doubly-linked list. These fields are located at the first 8 bytes of the user’s buffer.Doubly-linked lists sometimes offer a great exploit primitive, as during the list unlinking a corrupt node can trigger a Write-What-Where operation. However, we are no longer in the early 2000s, and this primitive isn’t going to work for us in this popular heap implementation. Instead, the developers deployed a protection mechanism known as “Safe Unlinking” which verifies the “forward” and “backward” pointers before using them.
Figure 16: The unlink
macro from uClibc
, using a “Safe Unlinking” approach.
Due to this security mitigation, we decided instead to attack the Fast-Bins
. These bins consist of singly-linked lists, meaning that they can’t be properly verified like the previous doubly-linked lists.
The Fast-Bins
are an array of various sized “bins”, each holding a singly-linked list of chunks of up to a given size. The minimal bin size contains buffers of up to 0x10 bytes, the next holds buffers of 0x11 to 0x18 bytes, and so on. During our study, we found an interesting bug in the implementation of the free()
method:
Figure 17: Implementation of the fastbin_index()
macro.
Relying on the fact that the smallest allocation size should be 0x10, the fastbin_index()
macro divides the size by 8, subtracts 2 from it, and uses the result as the index to the `Fast-Bin` array. If we can corrupt the metadata record of a given freed chunk, we can change this index to be one of the invalid values: -1 or -2.
Figure 18: Surroundings of the fastbins
array in the global malloc_state
.
Using the invalid value of -1 stores our freed buffer in the max_fast
field, which is responsible for the configurable maximum size for a fast bin. Storing a pointer in this field will probably wreak havoc, but what about the invalid -2 index?
Using a debugger, we saw that nothing is stored before the malloc_state
global struct, meaning that storing a pointer at fastbins[-2]
won’t ruin anything important. In addition, malloc()
won’t think of checking this invalid Fast-Bin
for any allocation to be returned to the user. For any practical use, we just created the /dev/null
bin, giving us a primitive to leak memory from the heap, a primitive that can help us shape it to our desired state.
Our vulnerability gives us a controllable heap-based buffer overflow from a buffer of 0x2B bytes and up to roughly 70 bytes. Due to basic alignments in the heap, we most probably get a buffer of size 0x30 (we get a larger buffer only if we run out of suitable buffers). In addition, there is some weird quirk in the heap’s implementation:
malloc_chunk
.prev_size
field of the malloc_chunk
that is positioned after us in memory.This bizarre implementation probably saved someone 4 bytes per malloc
chunk, but it sure didn’t make the code easier to read or debug.
Having all of these details in mind, our master plan is to overflow an adjacent free buffer that is located in a Fast-Bin
. Figure 19 shows how the buffers look before our overflow:
Figure 19: Our controlled buffer (in blue) placed before a freed buffer (in purple).
Figure 20 shows the same two buffers after our overflow:
Figure 20: Our overflow modified the size
and ptr
fields of the freed buffer (shown in red).
Using our overflow, we plan to modify the size
of the adjacent buffer to 1. As the size is always divisible by 4, the two least significant bits store the prev_inuse
and the is_mmaped
flag bits. In practice, we told the heap that our buffer is still in use, and that the size of the adjacent (free) buffer is zero.
We also modify what we hope is the single-linked pointer of a Fast-Bin
record. By changing this pointer to our own arbitrary address, we can trigger the heap into thinking that a new freed chunk is now stored there. By triggering a sequence of allocations of the size which matches that of the relevant Fast-Bin
, we can gain a Malloc-Where primitive, with which we plan to gain our code execution.
Here is a short description of the different scenarios we might encounter during our overflow:
Fast-Bin
. Success, as we corrupted the Fast-Bin
pointer and can use it to gain the desired Malloc-Where primitive.free()
ed. Partial Success, as the free()
operation will use our corrupt size
and store it in the /dev/null
bin.free()
ed. Not so bad. In this case, we hope we didn’t ruin anything, as we only modified 4 bytes in that buffer.We can only really lose in 1 of the 4 scenarios, and in the rest of them we either directly win, or advance towards winning. Let’s hope that the odds are in our favor, and try to overflow the least number of times needed for a successful exploit.
Special Note: After we finished this research, we devised a security mitigation called “Safe Linking” that protects the single-linked-lists in the heap from exploits like the one we have just described. This feature is already integrated into the latest versions of uClibc-NG
, and glibc
. For more info, here is our blog post on “Safe Linking”.
The most important part of shaping the heap in the form shown above is that the main CPU is quite weak. If we send many messages, and we send them fast enough, we actually starve some of the threads in the target program. This means that during our attack, the threads in our data flow are practically the only threads that are scheduled for execution. This important behavior drastically improves our success rate, as it reduces the noise in the heap to a minimal level.
Equipped with this important discovery, and knowing that we overflow the heap from an allocation of malloc-size 0x30 bytes, we devised a simple plan:
The first phase is the slowest one, as we want the buffers to gradually be freed before we start our overflow. Again, we aim to overflow directly into a freed buffer.
In the second phase, we hope to modify a Fast-Bin pointer to point directly at the address of the pointer to free()
in the GOT. This way, the third phase sends messages and one of them is stored in the GOT, as the heap mistakenly think it is a free heap buffer. This Malloc-Where primitive now turns into a fully controlled network packet that is written to an arbitrary memory address, a very strong exploit primitive. And the trigger itself is immediate; a call to free()
one of our messages jumps to execute our shellcode.
As the majority of the allocations use the heap (which is randomized by the Linux operating system), the task of locating a controllable address in which we can store our shellcode turned out to be a relatively complex one. In addition, as the modem sends short textual messages to the main CPU, we don’t have any global buffer that can store a long binary content of our choosing.
Eventually, we came to the conclusion that we can’t be picky and we must use the only global array that we’ve seen that is large enough: the array in which the bridge stores the incoming (LQI) neighbor records. This array has its pros and cons:
Pros:
Cons:
Later on, we also learned that we can’t even use the entire capacity of 0x41 records. However, when you don’t have a lot of options, you can’t afford to be picky.
The restrictions on each neighbor record:
On top of that, we don’t really have 10 adjacent controlled bytes, as the bridge checks that each record is unique. Each “extended network address” must be unique, which is easy due to its size of 8 bytes. Each “short network address” must also be unique, which is a totally different story, as we must be really creative to work around this restriction and make use of as many bytes as we can.
The proper way of delivering our “neighbor records” to the bridge is through the LQI (Link Quality Indicator) management messages. However, this time both the modem and the main CPU keep track of a proper state machine, and we can only send our messages as proper answers to requests that originate from the bridge itself. Unfortunately, the bridge only issues these requests after we finish the ZCL Phase, meaning that we can only prepare the shellcode in the target’s memory after the opportunity window for the exploitation is closed.
At this point, we examined the content of the memory array and saw that our network addresses are also stored here, although we’ve yet to send any LQI message. Further examination revealed that the DEVICE_ANNOUNCE
message we transmit during the Acceptance Phase also adds a single record to the neighbor array. This effectively means that it’s an address book array, and not a “neighbor array.”
Figure 21: Transmitting dummy DEVICE_ANNOUNCE
messages that will later on contain our shellcode.
This is where things started to get messy. For each new address, the bridge sends a matching Route Request
in an attempt to learn how to reach that new ZigBee node. These transmissions caused the bridge to be quite unstable, and affected the already shaky timeouts of the rest of the protocol’s state machines. Our solution to this new problem was to use multiple lightbulbs:
As only the first lightbulb can successfully complete the commissioning phase, the user has no clue that the bridge saw additional phantom lightbulbs.
If we use Mips16 assembly instructions, most of our instructions cost us 2 bytes each, and the more complex instructions cost us 4 bytes each. Ideally, we can use the first 8 bytes to perform a few assembly instructions, and use the last 2 bytes to jump ahead into the next record. This is where the uniqueness restrictions hit us hard. Most of the time we jump ahead 6 bytes (to the next record), and this means that the jump/branch instruction are the same each time and violates the restriction. However, we can use various jump instructions if for example we have conditional jumps.
The plan for our shellcode was to modify the original ELF and create a backdoor. Our heap modifications will most probably cause the process to be unstable, to say the least. If we modify the ELF itself, then after the process crashes, a software daemon (watchdog) restarts it, and this time it contains our embedded backdoor.
This plan was good on paper, but both the path to the ELF file and the backdoor itself were too big for our limitation of up to 10 consecutive controlled bytes in each record. The idea we came up with was a simple decoder loop:
While the shellcode worked OK in our dummy environment, it encountered a few obstacles when we tried it on our real target.
First, the shellcode was expensive: it cost us 0x19 records; we originally sent only 0x10 records when testing the exploit for the heap-based vulnerability. This mere addition of only 9 records turned out to be too much: the bridge was too unstable and our third lightbulb failed to reach the ZCL Phase.
After a lot of calculations, we managed to squeeze our nice configurable shellcode into 0x12 records of a not-so-readable shellcode. We successfully bypassed this size limitation, and managed to start debugging our shellcode on the real target (using our remote gdbserver).
This is where we found the flaws in our initial plan. A decoder loop in a Mips architecture mandates that we call sleep()
so that we won’t have any cache issues. Otherwise, our re-arranged records won’t propagate (be flushed) to the processors Instruction Cache (I-Cache), and effectively it executes some random garbage instead of our full shellcode. This sleep meant that we pretty much destroyed the target’s heap, and while we had our beauty sleep other threads were left to deal with our mess and crashed.
We couldn’t afford to enlarge our shellcode in our attempts to restore the program’s flow and avoid crashes, and it turns out that the ELF file wasn’t even writable during execution, so we had to devise a new plan for our shellcode.
If we already need to restore the execution flow so that the program won’t crash during our sleep()
, we might as well fully restore it and install a backdoor in our own memory address space. This way we don’t have to write to any file, and the lack of file path may remove the original need for an expensive decoder loop.
We went back to the drawing board, and after a few days managed to write a new shellcode that performs the following set of tasks (in order):
mprotect()
a specific memory page to RWX permissions, and modify the needed bytes to incorporate our backdoor in the right place.The second point was quite interesting, as it turned out that simply sending too many messages at a fast pace caused some threads to starve. When we finished, the watchdog saw that those threads were unresponsive, and exited the program alongside a nice syslog message that was sent to the vendor. This is probably the proper time to apologize to the vendors who might now think that something is wrong with one of their products, as it consistently sent them dozens of syslog reports.
Eventually, after some debugging, we had a working shellcode of 0x10 records. Figure 22 shows the memory layout of our shellcode, as is shown in IDA:
Figure 22: Memory layout of the final shellcode, as seen in IDA.
As you can see, the initial records hold the code to be executed, and the last 3 records store configuration variables including the data for the installed backdoor. Each code record executes a few assembly instructions and jumps ahead to the next record, until we finish all of our tasks and return to the original execution flow of the program.
We are not going to dive too deeply into the technical aspects of our backdoor, as we are not releasing a fully weaponized exploit to the public. What we can share is that our backdoor gave us a Write-What-Where primitive using a specially crafted message that we can now send to the target bridge from our “legitimate” lightbulb. We used this stable write primitive to write Scout’s loader to a RWX memory cave, and then used the fact that the code is still writable to redirect the execution to our new shellcode.
Scout’s loader simply connected back, over TCP, to our servers, received an executable to be dropped and deployed on the bridge, and executed it. In Figure 23 we can see the dropped /tmp/exploit
process that executes the next stage of our attack.
Figure 23: Process list from the bridge, showing our malware is executed as root.
Using our brand new Mips target, we were able to extend Scout to support the Mips architecture, and it worked like a charm in our test case.
In our attack scenario, we want to take control of the bridge from the ZigBee network, and use it as a leverage point to attack additional computers in the IP network. But first, our vulnerability mandates that we trick the user into searching for new lightbulbs, which is not exactly an easy step. Using the attack primitives from the original research, we devised the following plan:
For our demonstration, we chose to use the leaked NSA EternalBlue
exploit, just as we did in our FAX research. The exploit is executed from the bridge itself, and is used to attack unpatched computers inside the target’s IP network.
In the second part of the YouTube video, you can see an exploitation attempt on the same vulnerable Hue Bridge, when this time we installed on it our IoT nano agent. This nano-agent enforces Control-Flow-Integrity (CFI) and adds on-device protection to the firmware itself, thus successfully identifying and blocking our attack, even without familiarity with the exact 0-Day that we’ve exploited.
Check Point provides a consolidated security solution that hardens and protects the firmware of IoT devices. Utilizing a recently acquired technology, Check Point allows organizations to mitigate device level attacks before devices are compromised utilizing on-device run time protection. In addition to device-level security, Check Point offers network-level IoT protection by monitoring IoT traffic, identifying malicious communications or access attempts, and blocking them.
This research was done with the help of the Check Point Institute for Information Security (CPIIS) at Tel Aviv University. And on a more personal note, with the help of an old colleague: Eyal Ronen (@eyalr0).