Don’t be silly – it’s only a lightbulb

August 7, 2020

Research by: Eyal Itkin

Background

Everyone is familiar with the concept of IoT, the Internet of Things, but how many have heard of smart lightbulbs? You can control the light in your house, and even calibrate the color of each lightbulb, just by using a mobile app or your digital home assistant. The smart lightbulb management is done over WiFi or even ZigBee, a low bandwidth radio protocol.

A few years ago, a team of academic researchers showed how they can take over and control smart lightbulbs, and how this in turn allows them to create a chain reaction that can spread throughout a modern city. Their research brought up an interesting question: aside from triggering a blackout (and maybe a few epilepsy seizures), could these lightbulbs pose a serious risk to our network security? Could attackers somehow bridge the gap between the physical IoT network (the lightbulbs) and even more appealing targets, such as the computer network in our homes, offices or even our smart cities?

We’re here to tell you the answer is: Yes.

Continuing from where the previous research left off, we go right to the core: the smart hub that acts as a bridge between the IP network and the ZigBee network. By masquerading as a legitimate ZigBee lightbulb, we were able to exploit vulnerabilities we found in the bridge, which enabled us to infiltrate the lucrative IP network using a remote over-the-air ZigBee exploit.

Below is a Video Demonstration of this attack:

This research was done with the help of the Check Point Institute for Information Security (CPIIS) in Tel Aviv University.

Introduction

After we finished our previous research (Say Cheese: How I Ransomwared Your DSLR Camera) we decided to extend our debugger (Scout) to support additional architectures such as MIPS. As the best way to do so is to start researching MIPS, I asked on Twitter for suggestions of a good MIPS target for a vulnerability research.

As is mostly the case, people responded with a few promising leads, and the most promising one was from an old colleague of mine: Eyal Ronen (@eyalr0), who is now in a research position at the CPIIS (Small world, isn’t it?). Eyal Ronen suggested I continue his research on smart lightbulbs (See “Prior Work” in the next section). In their original research, his group was only able to take control of the lightbulbs themselves. He believed it might be possible to leverage this position in the ZigBee network to deploy an attack against the bridge that connects the ZigBee network to the IP network. In essence, this new attack vector enables an attacker to infiltrate the IP network from the ZigBee network, using an over-the-air attack.

Prior Work

In IoT Goes Nuclear: Creating a ZigBee Chain Reaction, a team of researchers led by Eyal Ronen (@eyalr0), Colin O’Flynn (@colinoflynn) and Adi Shamir, analyzed the security aspects of ZigBee smart lightbulbs. More specifically, they focused on the Philips Hue bridge and lightbulbs, showing a series of exploits:

  • Attackers can remotely “steal” a lightbulb from a given ZigBee network, and force it to join their network (demonstrated using a drone war-flying from 400 meters): https://www.youtube.com/watch?v=Ed1OjAuRARU
  • Due to an implementation flaw, even a regular lightbulb can be used to deploy this type of attack and “steal” lightbulbs from adjacent ZigBee networks.
  • Attackers that share the same ZigBee network with a target lightbulb can send a malicious firmware update to the lightbulb, thus taking complete control over it.

By combining these 3 demonstrated attacks, the researchers argued that by taking control of a chosen subset of lightbulbs in a smart city, they could trigger a nuclear-like chain reaction that could eventually take control of all the lightbulbs in the city.

Due to the nature of the attacks, the vendor was only able to block the second attack, thus leaving us with the capabilities to:

  1. “Steal” a lightbulb from a given ZigBee network in close proximity (400 meters).
  2. Update the firmware of that lightbulb, and use it to launch the next phase of our attack.

After receiving a detailed explanation of their original research, and armed with a Philips Hue Bridge that Eyal R. managed to salvage from their lab, we were ready to begin this promising new research.

ZigBee 101

According to Wikipedia, “ZigBee is an IEEE 802.15.4-based specification for a suite of high-level communication protocols used to create … low-power, low data rate, and close proximity wireless ad hoc networks.” Not to be confused with IEEE 802.11 (WiFi), according to the OSI model, IEEE 802.15.4 is the technical standard for the radio-based network protocol which acts as layers 1-2 of the ZigBee network stack.

Just to get a sense of this low data-rate protocol, the maximal transmission unit (MTU) for a frame in the underlying MAC layer of IEEE 802.15.4 is 127 bytes. This means that unless fragmentation is used, the messages of the ZigBee network stack are very limited in size. Hopefully, this limitation won’t restrict us too much in finding, and later on exploiting, vulnerabilities in the ZigBee implementation.

On top of the narrow radio network layer, ZigBee defines a full stack of network layers, as can be seen in this figure taken from (an older version of) the ZigBee specs:

Figure 1: ZigBee network stack outline.

In short, we can roughly divide the network stack into 4 layers (in ascending order):

  1. Physical / MAC layer – Radio-based frames defined by IEEE 802.15.4.
  2. Network Layer (NWK) – Responsible for routing, relaying and security (encryption).
  3. Application Support Sublayer (APS) – Routes the message to the correct upper application.
  4. Application Layer (ZDP / ZCL / etc.) – The logical applicative layer, depending on the incoming message (multiple layers are present at the same time).

ZDP = ZigBee Device Profile

ZCL = ZigBee Cluster Library

For those of you who are familiar with the SNMP protocol, ZCL looks like a different encoding of the same logical interface. The ZCL layer allows devices to query (READ_ATTRIBUTE) and set (WRITE_ATTRIBUTE) a collection of configuration values (clusters), which ultimately allows the operator (the bridge) to control the lightbulbs. For example, attributes for the Color Control cluster include:

  • Color Temperature Physical Min/Max
  • Color Point Red/Green/Blue

This example also shows that these are not ordinary white/yellow lightbulbs. These smart lightbulbs support a wide range of colors, which can be controlled using an (RGB) color palette.

Meet Our Target

Our target for this research is the Philips Hue line of products, and more specifically, the Philips Hue Bridge. As a side note: the Hue line of products originated in the Philips-Lighting division of Philips  and is now branded under a third company called Signify.

While “smart” lighting solutions aren’t that popular yet in Israel, we found this isn’t the case in many other countries. For instance, this article from 2018 states that Philips Hue dominates 31% percent of the smart lighting market share in the UK, used by over 430,000 households. In fact, when we presented our research results to some of the VPs in our company, they told us that all the lights in their house are from the Philips Hue brand.

The following graphic, taken from the original research paper, shows the network architecture for a home or office that uses this product:

Figure 2: ZLL (ZigBee Light Link) architecture.

ZLL is an acronym for ZigBee Light Link, which is a customization layer to the ZigBee network stack that focuses on light devices: both the lightbulbs and the bridge that controls them.

On the one hand, we have the ZigBee devices: lightbulbs, switch and the bridge. And on the other hand, we have the IP devices in the “regular” computer network: our mobile phone, a router and again, the bridge. As is inferred by his name, the bridge is the only device that is present in both networks, and its role is to translate the commands we send from the mobile app into ZigBee radio messages that are then sent to the lightbulbs.

Bridge Architecture

We already knew that the bridge uses a MIPS CPU (that’s why we originally chose it), but it turns out that its architecture is even more complex. In Figure 3, we show the board of the bridge (2.0 model) after we extracted it from the plastic case:

Figure 3: The electric board of the bridge hardware model (2.0).

  • On the right, marked in blue, is the main CPU: QCA4531-BL3A.
  • On the left, marked in red, is an Atmel CPU (ATSAMR21E18E) that implements the lower layers of the ZigBee stack.

From this point on, we refer to the Atmel CPU as the modem. This is mainly because the main CPU offloads the handling of low level ZigBee network tasks to be performed solely on this processor. This means that both the physical layer and the NWK layer are handled by the modem, which in turn might query the main CPU for needed configuration values.

To our surprise, the main CPU runs a Linux kernel and not a real-time-operating-system. This turned out to be quite useful when we had to extract the firmware and debug the main process responsible for the core logic of the bridge.

On his website, Colin O’Flynn (@colinoflynn) describes how to connect to the exposed serial port and gain root privileges on the board. This is a great guide to anyone who deals with embedded Linux devices, and specifically deals with the U-Boot bootloader. Unfortunately, I didn’t have the necessary equipment to connect to the serial interface, which I discovered after I repeatedly failed to reproduce Colin’s results. Fortunately, I consulted my little brother who helped me out and told me which serial cables I needed to order. And so, we started reverse engineering the old firmware version (from 2016) I received from Eyal R. while I waited for the cables to arrive.

ipbridge

The core process in the main CPU is the ipbridge process. A basic recon shows it is a classic case of an ELF target:

  • The code itself is compiled to a static memory address.
  • The libraries (*.so files), stack and heap are randomized by the operating system.
  • There are no stack canaries in place.
  • The process runs as “root” – which is great news.

This is a somewhat mixed state that we often see when dealing with targets running Linux. The operating system enables some security features by default, and usually the vendor doesn’t try to actively enable additional features such as PIE (Position Independent Executable) or even stack canaries. From our perspective as attackers, the exploitation won’t be easy as there is some ASLR (Address Space Layout Randomization) in place, but it is still possible because there are some fixed known memory addresses we can use in our exploit.

Before we started reverse engineering the process, we noticed that the disassembler had trouble distinguishing between Mips and Mips16 code sections (similar to the Arm and Thumb case in an ARM firmware). This was a good time to test if Thumbs Up, originally tested only on Intel and Arm binaries, also produces improved analysis in our Mips binary. Luckily for us, it worked quite well: initially we had 2525 functions, and after the execution we had a cleaner binary with 3478 marked functions. Now we started reverse engineering our binary without a need to manually improve IDA Pro’s analysis.

Immediately after we started the reverse engineering phase, we saw something odd. For some reason, it looks like we expect our messages to arrive in a textual form?!

Figure 4: Command strings to look for in the incoming messages.

In Figure 4, we can see the list of strings we expect to find in the incoming message. Each string routes our message to a specific handler function, such as the function we named EI_zcl_main_handler. At this point, we checked the ZigBee specs again, as it made no sense. The protocol should be binary, and with a really low bandwidth, why does our program think it should receive long strings?

After reading the conclusions from Eyal R. and Colin once more, it suddenly came clear. The modem has an additional role that we initially ignored: it translates the binary messages to a textual representation, and then sends them through a USB-to-Serial interface. This way the main CPU reads the easy to handle textual messages from a serial device that is mapped as a file in the operating system.

Colin found evidence that the lightbulbs use the Atmel BitCloud SDK, which is now closed source and must be purchased from Atmel. Therefore, it makes sense to assume that the same software stack is also used as a “decoder” layer in the modem CPU in the bridge:

  1. An incoming message is parsed and verified by the BitCloud software stack.
  2. The parsed message is then converted into a textual representation.
  3. This textual message is sent to the main CPU for handling.

This way, the main CPU only needs to be familiar with logical aspects of the ZigBee stack, but doesn’t need to implement complicated decoding and parsing features that are already included in the stack that is shipped with the Atmel modem.

From a security perspective, this design choice has its pros and cons. As far as we are concerned, it has a massive implication. We only have the firmware for the ipbridge process, which we can also debug using a remote gdbserver we compiled and placed on the bridge’s file system. The firmware for the modem is encrypted and it will not be easy to recreate the steps from the original research to extract this key (using a power analysis attack) and decrypt the modem’s firmware.

This means that we can only treat the modem as a black box that performs a lot of parsing, and maybe even holds a few state machines. We have a few hints from the partial code version that is found on GitHub (that is a few years old), but for all intents and purposes it is simply a black box that can block some of our attack attempts if they demand we send malformed messages.

Nothing about this research is going to be easy, and so, we just add this new obstacle to our list and continue on.

Looking for vulnerabilities – Round I

Now that we understood why the modem sends us textual messages into a serial device, we tracked down the flow of the messages between the different threads, and started looking for vulnerabilities in each of the different handlers. Our efforts focused on the ZCL handler, as it supports read/write operations on a wide variety of data type attributes:

  • E_ZCL_UINT32 (0x23) – 4 bytes integer.
  • E_ZCL_UINT16 (0x21) – 2 bytes integer.
  • E_ZCL_UINT8 (0x20) / E_ZCL_ENUM8 (0x30) – 1 byte integer / 1 byte enum value.
  • E_ZCL_BOOL (0x10) – 1 bit Boolean value (True/False).
  • E_ZCL_ARRAY (0x48) – variable length byte array, using a 1-byte length field.

As you can probably understand, handling variable length fields in an embedded device is a sure recipe for vulnerabilities. Figure 5 shows the assembly code that handles this case:

Figure 5: Assembly snippet for the vulnerable handling of array data types.

Note: Bear in mind that the MIPS architecture uses a delay slot, so on the call to malloc(), the value 0x2B is passed as an argument inside the delay slot in the instruction: li   $a0, 0x2B. This can be a bit confusing for anyone reading MIPS assembly for the first time.

What did we find? An attacker could send a malicious response for a READ_ATTRIBUTE message, containing a malformed byte array that is bigger than the fixed size of 43 bytes (0x2B). This triggers a controlled heap-based buffer overflow, without any byte restrictions.

Possible limitations to this potential vulnerability:

  • As ZCL is a relatively high layer in the ZigBee stack, we can only afford to send an array of up to 70 bytes. Otherwise, our message will be bigger than the network restrictions.
  • A state machine check might enforce that we only respond to an existing request.
  • A state machine check might enforce that we only respond with the correct requested data type.
  • The modem might drop our message if it violates its black box logic.

This is not exactly the easiest vulnerability to exploit, but it’s a serious vulnerability nevertheless.

In an instance of good timing, our serial cables finally arrived and we immediately started checking if we had indeed found a vulnerability. We compiled a gdbserver and placed it on the bridge’s file system, and now encountered a new obstacle: we don’t have a transmitter with which to send our attack. After another consult with Eyal R., we bought the evaluation board of the lightbulb’s CPU, exactly as his team did in their research.

Meanwhile, we found a hack that allowed us to verify the existence of this vulnerability even without transmitting a radio message over the air (hoping that the modem won’t block us later on). The ipbridge process supports a debug testing mode that is activated by connecting to two named pipes that the process listens on using a debug thread: /tmp/ipbridgeio_in and /tmp/ipbridgeio_out. While these debug capabilities aren’t really helpful, we patched the binary so that messages that arrive through these pipes are added to the message queue as if they arrived from the modem itself.

Using this small binary patch, we were able to create our own process that connects to the named pipes and sends (textual) messages aiming to hit the vulnerable code function. After some trial and error, and using our debugger, we were able to trigger the vulnerability and prove it exists. The only caveat is that the modem can still block it, and this requires us to transmit the attack over radio.

While waiting for our transmitter, our full Philips Hue starter kit arrived with a brand new 2.1 model bridge and 3 lightbulbs. This looked like the right time to extract the new firmware from the bridge, together with updating the 2.0 bridge to the latest firmware. After all, up until now we worked on firmware from 2016, and things might have changed in the meantime.

Sadly, things did indeed change.

The first thing we noticed about the new firmware is its size. For some reason, the ipbridge ELF file grew from 1221KB to 3227KB. Opening it in IDA Pro showed us the main difference: the binary was (accidentally?) shipped with debug symbols. This is great news that can really help us in our reverse engineering attempts. Figure 6 shows some of these symbols:

Figure 6: Function symbols of the new firmware.

Using this new discovery, we learned that our initial reverse engineering was relatively accurate, and the name of the vulnerable function turned out to be: SMARTLINK_UTILS_ReadAttributeValue.

When analyzing the vulnerable function in the new firmware version, we had an unpleasant surprise. The list of supported data types was updated, and now the vendor supports character strings (0x42) instead of byte arrays (0x48). Although strings are still variable in length, the allocation now changed to be more appropriate to null terminated strings:

  1. A 1-byte length field (denoted as L) is read from the incoming message.
  2. A buffer of size L + 1 is allocated.
  3. L data bytes are copied from the incoming message into the allocated buffer.

A fixed heap buffer is no longer used, and this change of supported data types just closed our vulnerability. Time to search for a new one.

Looking for vulnerabilities – Round II

We put the ZCL module aside and eventually found our way to the ZDP module, more specifically, to the handler of incoming LQI (Link Quality Indicator) management responses. These messages are part of a module that is responsible for neighbor discovery. Periodically, the bridge queries the lightbulbs for their known neighbors in the ZigBee network. While the name suggests that the messages are focused on the quality of the radio transmission, the message structure is actually focused on the full set of network addresses for each neighbor.

The context for each neighbor, as seen in these messages:

  • 8 bytes – Extended address: Globally unique network address (similar to the Ethernet MAC address).
  • 2 bytes – Network address: Short network address that is locally unique in the current ZigBee network.
  • 2 bytes – PAN Id: Personal Area Network Identifier, the identifier of the local ZigBee network.
  • 1 byte – LQI: Link Quality Indicator in the range of 0-255.
  • 3 -bytes – Misc: Other flags, adding up to 16 bytes per record.

As both parties need to tell each other about a variable number of neighbors, which can include up to 0x41 supported records in the ipbridge global neighbor array, these messages include a fragmentation format. In each response, the lightbulb tells the bridge that it is currently answering with L records, from offset X to offset X + L - 1, out of possible S records.

As you may recall, the message sizes in the ZigBee stack are quite small, so using so many indices in each message, and sending multiple records of 16 bytes each, really limits the number of records that can be included in each message. As a result, the developers store the incoming records on the stack in an array that can hold up to 6 records. However, there are no checks in place to make sure that the incoming length field is indeed small enough, leading to a potential stack-based buffer overflow.

You might wonder how we are planning to transmit such a “huge” message and overflow the buffer. Due to the physical limitation on the message sizes over radio, our only hope is to find a vulnerability in the modem, and then use this stack-based overflow to hop from the modem and into the main CPU. This means that even if we just found a vulnerability, it could only be exploited using an additional vulnerability in an additional CPU for which we don’t even have the firmware. Not exactly a great plan, but in the absence of anything else…

Before starting such a daring move, we once again used our hack to inject packets, and tried to trigger a controlled stack-based buffer overflow to check the exploitability of this new vulnerability. Unfortunately, the return address on the stack lies exactly in an offset that we don’t fully control when overflowing. Our overflow occurs by parsing incoming fields and placing them in a local struct. It turns out that we can only overflow the return address with the value 0x00000004.

Verdict: Not exploitable. At least this saved us the need to try and look for vulnerabilities in the modem.

Figure 7: Missing check in LQI message handling, together with the verdict – not exploitable.

Side note: The maximal number of records that is allowed in the BitCloud SDK is 2. The ZigBee protocol uses multiple indices in a fragmentation message that can only carry up to 2 records in each message. It’s not exactly efficient, to say the least.

Looking for vulnerabilities – Round III – CVE-2020-6007

Happily enough, 3 turned out to be our lucky number. After we finished auditing the code for all of the different message handlers, we had an intriguing question: When we send ZCL attributes, who handles them after the initial (no longer vulnerable) parsing?

While trying to answer this question, we found a new thread named applproc. This thread reads the structure that includes our parsed attribute, checks an unknown state-machine check, and if we are fortunate, delivers our message to the CONFIGURE_DEVICES_ReceivedAttribute function. Figure 8 shows the assembly of this function:

Figure 8: Assembly of function CONFIGURE_DEVICES_ReceivedAttribute.

For some unknown reason, an opcode is extracted from the incoming struct:

  • 0x0F – Probably means a string type, because the input is duplicated using strdup().
  • 0x10 – Probably means a byte array, because the input is duplicated using the same vulnerable code pattern we saw in the old firmware version.

When we went back to check how this structure is initialized, we saw this snippet:

Figure 9: Using the value 0x10 when handling an incoming string, thus creating a type mismatch.

It looks like the transition from supporting arrays to strings was done only halfway, as the string is marked by mistake as “array” using the constant 0x10 instead of 0x0F. This means that once again we have a heap-based buffer overflow vulnerability, and we were able to trigger it using our hack alongside a slight modification to our previous PoC.

Now that we have a vulnerability, one that still depends on an unknown state machine check we need to pass, it is a good time to unpack the arrived transmitter and try to trigger the vulnerability over the air. In the next chapters, we describe the exploitation process for this vulnerability, together with the ZigBee obstacles we discovered and overcame in the process.

Sniffing for clues

It is important to note that we specifically chose the ATMEGA256RFR2-XPRO evaluation board for multiple reasons:

  • If we manage to perform the entire exploit from the evaluation board, we prove it is feasible to perform it from a lightbulb, as they have the exact same hardware capabilities: same CPU, same antenna, etc.
  • We hope to salvage some C code from the previous research that was conducted by Eyal Ronen (@eyalr0) and Colin O’Flynn (@colinoflynn).

Surprisingly enough, the first point turned out to be a crucial one, but we discuss this part later on.

You might expect that when you buy an Atmel product that comes with a Visual Studio based IDE called “Atmel Studio”, it’s easy to create a sample ZigBee project that simply sniffs messages and prints them to the output/serial. Sadly, this wasn’t the case. After some Googling, we found that Atmel provides a series of useful YouTube tutorials, like this one, in which a man sailing on a boat (we’re not kidding) tells you how to use the extension manager and download a package that allows you to create sample Wireless projects. This is exactly what we initially looked for.

Now that we were able to sniff some messages, we paired a lightbulb with our bridge (a process called “commissioning”), and printed the messages to the serial output. At this point, we realized that while we now have some recorded messages, we don’t really have a proper way to parse them into a human-readable format. We tried a variety of open source Python scripts for ZigBee, but none of them were really useful. We did manage, however, to load the hex-dumped packets into Wireshark, using the following encapsulation type shown in Figure 10:

Figure 10: Wireshark encapsulation type for IEEE 802.15.4 (Zigbee) messages.

Important note: Wireshark fails to analyze the messages if they have an invalid FCS (Frame Check Sequence) field. When we transmit messages, this field is automatically calculated and added by the antenna. Therefore, we recommend that you drop this field from incoming messages, and pick in advance the encapsulation type that tells Wireshark that the FCS field is not present. This makes it easier to analyze dumps of incoming and outgoing messages.

Even a short glance at the dumped conversation taught us that a few things are missing:

  • Most of the messages are encrypted, and we don’t have the key.
  • Some of the commissioning messages are missing, as the plain-text conversation looks broken.

As we mentioned earlier, the decision to use our specific evaluation board proved to be crucial. The protocol transmits messages so quickly that the baud rate of the serial interface causes critical delays. In short, in the time we print the messages / send them to our PC, we miss important messages from the ongoing conversation. We have to implement the entire exploit on the evaluation board (in C) if we want to have even the slightest chance of keeping up with the fast pace of the messages in the ZigBee protocol.

Meanwhile, we buffered the messages on the board itself, and sent them to the PC only when our buffer was full. This enabled us to record most of the messages. However, we still missed a few when both the lightbulb and the bridge transmitted together during short periods of time.

Opening the crypto layer

Wireshark supports the option to decrypt the ZigBee messages and analyze their decrypted payload, but you must supply it with the proper key. This was a good time to read about the protocol and learn how its crypto design works.

In short, each device uses two important keys:

  • Transport Key – A global (broadcast) key that is shared between all of the ZigBee devices.
  • Network Key – A network specific key that is distributed by the bridge to the now joining lightbulb (during the “commissioning” phase).

The vast majority of messages should be only encrypted by the network key, and the transport key is only used when distributing the network key to a lightbulb during the commissioning phase. Which brings us to the immediate problem: we need the transport key, otherwise Wireshark won’t tell us what our network key is.

Figure 11 shows the sample ZigBee .pcap file from Wireshark’s website, and the Transport Key message is highlighted:

Figure 11: Sample ZigBee recording, the Transport Key is encrypted and shown as: APS: Command

We can see that since we don’t have the transport key, we can’t decrypt the Transport Key message, and it is merely shown as a generic APS command.

Although we found multiple keys when researching the topic, none of them worked. It seems we were not the first to tackle this issue, as eventually we reached this blog post in which the author details the solution to the problem. It turns out that the “regular” keys are used in a “touchlink commissioning”, but our “classic commissioning” uses a different secret key. Fortunately, both of the keys are included in the article, and they indeed worked. This time we managed to successfully decrypt the message inside Wireshark. Figure 12 shows the decrypted message:

Figure 12: A decrypted Transport Key message, containing the Network Key.

Note: We deliberately chose to include the actual network key in this image. Later on we also include a link to a full .pcap recording of the entire commissioning conversation with our model bridge.

When implementing the crypto layers on our evaluation board, we relied on the excellent implementation from Wireshark’s ZigBee dissector, found on GitHub.

Naive Attack Attempt

After we found the network key with which the lightbulbs and the bridge encrypt all of their messages, we can try to craft our own hostile ZCL message and check if it triggers our breakpoint in the vulnerable function. After a few rounds of trial and error, we had some good news and bad news:

  • Good news: There are no ZCL state machine checks in place. The bridge parses response messages to requests that it didn’t even issue.
  • Bad news: The unknown state machine check that we found near the vulnerable function is routing our message to a different function.

The check is shown in Figure 13:

Figure 13: Some state machine check that blocks our attack.

Initially, it looked like we might need a minimum number of 2 or 3 lightbulbs in the network, but this didn’t work either. After diving back into the code, we learned that function checks if the lightbulb that sent the message is currently undergoing a commissioning process.

Conclusion #1: The vulnerable function is only reachable when commissioning a new lightbulb into the ZigBee network. Legitimate participants in the ZigBee network can’t trigger the vulnerability we wish to exploit.

Classic Commissioning

“Classic Commissioning” is the process of pairing (commissioning) a new lightbulb into our ZigBee network using the standard mobile app. In our case, we used the Philips Hue app from the Android Play Store.

Surprisingly, while there are many documents and specs that describe the messages in the ZigBee protocols, we failed to find a proper document that describes the flow of messages during the commissioning process. Therefore, we merged two approaches, and hoped that eventually we would implement enough messages and convince our mobile app that the bridge really discovered a “new” lightbulb. The approaches are:

  • Record as many messages as we can from the commissioning of an ordinary lightbulb.
  • Gradually implement the different messages and transmit them to the bridge. Each time the bridge answers back with a new request, we implement the matching response for it and repeat the same experiment.

Conclusion #2: The bridge won’t accept new lightbulbs into the network, unless the user actively ordered it to search for new lightbulbs. This is a good design choice that significantly lowers the attack surface on the bridge itself. In our attack scenario, we have to somehow trick the user into pressing this button in his app.

This also means that before each experiment, we had to press the button in the app (giving us a grace period of 1-2 minutes), and only then execute our program from the evaluation board. This was the case when we were trying to learn about the messages in the commissioning phase, and also when testing the exploit. Not exactly a smooth automatic procedure, but it worked eventually.

As we promised earlier, here is a link to a full .pcap recording of the classic commissioning with our model bridge, up to the point where the mobile app notifies us about a new lightbulb. The messages are stripped of their FCS field, and the pcap doesn’t contain the IEEE 802.15.4 Ack messages, as they are sent as an acknowledgment after almost every message.

Implementation Note: There are multiple strict timing restrictions in the ZigBee protocol and in the bridge’s modem, making the entire conversation extremely unreliable if not timed correctly. This means that we must acknowledge incoming messages very quickly, a restriction that was impossible in our custom implementation of the ZigBee network stack. Therefore, we configured our evaluation board to automatically acknowledge incoming messages in its MAC layer. This change had a significant downside: it means that we can no longer sniff messages in promiscuous mode.

Figure 14 shows our crafted new lightbulb, as it’s shown in the app if the user requests full details.

Figure 14: Our crafted lightbulb, as seen in the Philips Hue mobile app.

As you can see, there are multiple controllable string fields that are exchanged during the commissioning phase. We chose to label our new lightbulb as a brand new Check Point Research lightbulb, model “CPR123”.

The commissioning phase can be divided into 4 main parts:

  1. Association: The new lightbulb presents itself, and is associated with a short network address.
  2. Acceptance: The new lightbulb receives the network key, and announces itself using a Device Announce message.
  3. Bureaucracy: The bridge queries the lightbulb for multiple descriptors.
  4. ZCL: The bridge issues multiple ZCL (ZigBee Cluster Library) READ_ATTRIBUTE requests to learn about the specs of the lightbulb.

Only during the ZCL Phase can we start sending our malicious ZCL messages, in an attempt to trigger the heap-based buffer overflow that we found earlier. We can send malicious response messages regardless of the actual requests that are issued by the bridge, but we can only start sending them after we reach this phase in the commissioning process.

Attacking the heap

We decided to tackle our problems one at a time. Our first goal was to succeed in exploiting the heap-based vulnerability and jump to an arbitrary memory address, and later on discover where to jump. This turned out to be the wrong decision, as the heap varied a lot based on the messages with which we placed the shellcode in the target’s memory.

The first thing to do when exploiting a heap-based buffer overflow is to check which heap implementation is used by the target. In our case, the target uses uClibc, which stands for “micro LibC”. The exact version was clearly listed in the library’s file name: libuClibc-1.0.14.so. With a few different heap implementations that are supported by this library to choose from, we easily spotted in the binary the use of the “malloc-standard” implementation, which is based on dlmalloc.

For a small LibC implementation whose prime target consists of products with constrained memory and CPU resources, the implementation is quite straightforward:

  • Fast-Bins store small-sized free buffers using a singly-linked list.
  • A doubly-linked list stores the rest of the free buffers.
  • On multiple occasions, consolidation is done to optimize/clean the heap.

For a “standard” dlmalloc implementation, this is the meta-data used by this heap implementation:

Figure 15: The malloc_chunk structure used in our heap implementation.

Notes:

  • When a buffer is allocated and used, the first two fields are stored before the user’s data buffer.
  • When the buffer is freed and placed in a Fast-Bin, the third field is also used, and points to the next node in the Fast-Bin’s linked list. This field is located at the first 4 bytes of the user’s buffer.
  • When the buffer is freed and isn’t placed in a Fast-Bin, both the third and fourth fields are used as part of the doubly-linked list. These fields are located at the first 8 bytes of the user’s buffer.

Picking our target inside the heap

Doubly-linked lists sometimes offer a great exploit primitive, as during the list unlinking a corrupt node can trigger a Write-What-Where operation. However, we are no longer in the early 2000s, and this primitive isn’t going to work for us in this popular heap implementation. Instead, the developers deployed a protection mechanism known as “Safe Unlinking” which verifies the “forward” and “backward” pointers before using them.

Figure 16: The unlink macro from uClibc, using a “Safe Unlinking” approach.

Due to this security mitigation, we decided instead to attack the Fast-Bins. These bins consist of singly-linked lists, meaning that they can’t be properly verified like the previous doubly-linked lists.

The Fast-Bins are an array of various sized “bins”, each holding a singly-linked list of chunks of up to a given size. The minimal bin size contains buffers of up to 0x10 bytes, the next holds buffers of 0x11 to 0x18 bytes, and so on. During our study, we found an interesting bug in the implementation of the free() method:

Figure 17: Implementation of the fastbin_index() macro.

Relying on the fact that the smallest allocation size should be 0x10, the fastbin_index() macro divides the size by 8, subtracts 2 from it, and uses the result as the index to the `Fast-Bin` array. If we can corrupt the metadata record of a given freed chunk, we can change this index to be one of the invalid values: -1 or -2.

Figure 18: Surroundings of the fastbins array in the global malloc_state.

Using the invalid value of -1 stores our freed buffer in the max_fast field, which is responsible for the configurable maximum size for a fast bin. Storing a pointer in this field will probably wreak havoc, but what about the invalid -2 index?

Using a debugger, we saw that nothing is stored before the malloc_state global struct, meaning that storing a pointer at fastbins[-2] won’t ruin anything important. In addition, malloc() won’t think of checking this invalid Fast-Bin for any allocation to be returned to the user. For any practical use, we just created the /dev/null bin, giving us a primitive to leak memory from the heap, a primitive that can help us shape it to our desired state.

Overflow plan

Our vulnerability gives us a controllable heap-based buffer overflow from a buffer of 0x2B bytes and up to roughly 70 bytes. Due to basic alignments in the heap, we most probably get a buffer of size 0x30 (we get a larger buffer only if we run out of suitable buffers). In addition, there is some weird quirk in the heap’s implementation:

  • Bytes 0x00-0x04: The size field from our malloc_chunk.
  • Bytes 0x04-0x2C: The user’s data buffer (which is 4 bytes shorter than we need).
  • Bytes 0x2C-0x30: The “missing” 4 bytes, also acting as the prev_size field of the malloc_chunk that is positioned after us in memory.

This bizarre implementation probably saved someone 4 bytes per malloc chunk, but it sure didn’t make the code easier to read or debug.

Having all of these details in mind, our master plan is to overflow an adjacent free buffer that is located in a Fast-Bin. Figure 19 shows how the buffers look before our overflow:

Figure 19: Our controlled buffer (in blue) placed before a freed buffer (in purple).

Figure 20 shows the same two buffers after our overflow:

Figure 20: Our overflow modified the size and ptr fields of the freed buffer (shown in red).

Using our overflow, we plan to modify the size of the adjacent buffer to 1. As the size is always divisible by 4, the two least significant bits store the prev_inuse and the is_mmaped flag bits. In practice, we told the heap that our buffer is still in use, and that the size of the adjacent (free) buffer is zero.

We also modify what we hope is the single-linked pointer of a Fast-Bin record. By changing this pointer to our own arbitrary address, we can trigger the heap into thinking that a new freed chunk is now stored there. By triggering a sequence of allocations of the size which matches that of the relevant Fast-Bin, we can gain a Malloc-Where primitive, with which we plan to gain our code execution.

Here is a short description of the different scenarios we might encounter during our overflow:

  • The buffer after us is free and placed in a Fast-Bin. Success, as we corrupted the Fast-Bin pointer and can use it to gain the desired Malloc-Where primitive.
  • The buffer after us is in use, and will be free()ed. Partial Success, as the free() operation will use our corrupt size and store it in the /dev/null bin.
  • The buffer after us is in use, and will never be free()ed. Not so bad. In this case, we hope we didn’t ruin anything, as we only modified 4 bytes in that buffer.
  • The buffer after us is free and we corrupted a doubly-linked list. Failure, and let’s hope that the heap won’t check this buffer again, as it triggers a crash.

We can only really lose in 1 of the 4 scenarios, and in the rest of them we either directly win, or advance towards winning. Let’s hope that the odds are in our favor, and try to overflow the least number of times needed for a successful exploit.

Special Note: After we finished this research, we devised a security mitigation called “Safe Linking” that protects the single-linked-lists in the heap from exploits like the one we have just described. This feature is already integrated into the latest versions of uClibc-NG, and glibc. For more info, here is our blog post on “Safe Linking”.

Heap Shaping

The most important part of shaping the heap in the form shown above is that the main CPU is quite weak. If we send many messages, and we send them fast enough, we actually starve some of the threads in the target program. This means that during our attack, the threads in our data flow are practically the only threads that are scheduled for execution. This important behavior drastically improves our success rate, as it reduces the noise in the heap to a minimal level.

Equipped with this important discovery, and knowing that we overflow the heap from an allocation of malloc-size 0x30 bytes, we devised a simple plan:

  1. Send multiple ZCL strings that are allocated to sizes of 0x28 and 0x30.
  2. Send a (very) few overflowing ZCL strings, aiming the hijacked Fast-Bin pointer to the Global Offset Table (GOT).
  3. Send an additional burst of messages of sizes 0x30, hoping to get the Malloc-Where primitive.

The first phase is the slowest one, as we want the buffers to gradually be freed before we start our overflow. Again, we aim to overflow directly into a freed buffer.

In the second phase, we hope to modify a Fast-Bin pointer to point directly at the address of the pointer to free() in the GOT. This way, the third phase sends messages and one of them is stored in the GOT, as the heap mistakenly think it is a free heap buffer. This Malloc-Where primitive now turns into a fully controlled network packet that is written to an arbitrary memory address, a very strong exploit primitive. And the trigger itself is immediate; a call to free() one of our messages jumps to execute our shellcode.

Storing our Shellcode

As the majority of the allocations use the heap (which is randomized by the Linux operating system), the task of locating a controllable address in which we can store our shellcode turned out to be a relatively complex one. In addition, as the modem sends short textual messages to the main CPU, we don’t have any global buffer that can store a long binary content of our choosing.

Eventually, we came to the conclusion that we can’t be picky and we must use the only global array that we’ve seen that is large enough: the array in which the bridge stores the incoming (LQI) neighbor records. This array has its pros and cons:

Pros:

  • The global array is located in a fixed known memory address.
  • The array (like all other global writable variables) has memory permissions of RWX.
  • The array is relatively big: it can hold up to 0x41 (65) records of 0x10 (16) bytes.

Cons:

  • We don’t fully control the entire record of 16 bytes.

Later on, we also learned that we can’t even use the entire capacity of 0x41 records. However, when you don’t have a lot of options, you can’t afford to be picky.

The restrictions on each neighbor record:

  • Bytes 0x00-0x08: Extended network address – Fully controlled.
  • Bytes 0x09-0x0A: Short network address – Fully controlled.
  • Bytes 0x0A-0x10: Misc fields – Uncontrolled.

On top of that, we don’t really have 10 adjacent controlled bytes, as the bridge checks that each record is unique. Each “extended network address” must be unique, which is easy due to its size of 8 bytes. Each “short network address” must also be unique, which is a totally different story, as we must be really creative to work around this restriction and make use of as many bytes as we can.

The proper way of delivering our “neighbor records” to the bridge is through the LQI (Link Quality Indicator) management messages. However, this time both the modem and the main CPU keep track of a proper state machine, and we can only send our messages as proper answers to requests that originate from the bridge itself. Unfortunately, the bridge only issues these requests after we finish the ZCL Phase, meaning that we can only prepare the shellcode in the target’s memory after the opportunity window for the exploitation is closed.

At this point, we examined the content of the memory array and saw that our network addresses are also stored here, although we’ve yet to send any LQI message. Further examination revealed that the DEVICE_ANNOUNCE message we transmit during the Acceptance Phase also adds a single record to the neighbor array. This effectively means that it’s an address book array, and not a “neighbor array.”

Figure 21: Transmitting dummy DEVICE_ANNOUNCE messages that will later on contain our shellcode.

This is where things started to get messy. For each new address, the bridge sends a matching Route Request in an attempt to learn how to reach that new ZigBee node. These transmissions caused the bridge to be quite unstable, and affected the already shaky timeouts of the rest of the protocol’s state machines. Our solution to this new problem was to use multiple lightbulbs:

  1. A legitimate lightbulb that appears in the user’s mobile app, and later on exploits a backdoor we plan on installing.
  2. A fake lightbulb that advertises multiple “lightbulbs” and in practice places our shellcode in the global memory array of the target, as seen in Figure 21.
  3. An additional fake lightbulb that reaches the ZCL Phase and exploits the vulnerability now that the shellcode is already in memory.

As only the first lightbulb can successfully complete the commissioning phase, the user has no clue that the bridge saw additional phantom lightbulbs.

Ideal Shellcode Design

If we use Mips16 assembly instructions, most of our instructions cost us 2 bytes each, and the more complex instructions cost us 4 bytes each. Ideally, we can use the first 8 bytes to perform a few assembly instructions, and use the last 2 bytes to jump ahead into the next record. This is where the uniqueness restrictions hit us hard. Most of the time we jump ahead 6 bytes (to the next record), and this means that the jump/branch instruction are the same each time and violates the restriction. However, we can use various jump instructions if for example we have conditional jumps.

The plan for our shellcode was to modify the original ELF and create a backdoor. Our heap modifications will most probably cause the process to be unstable, to say the least. If we modify the ELF itself, then after the process crashes, a software daemon (watchdog) restarts it, and this time it contains our embedded backdoor.

This plan was good on paper, but both the path to the ELF file and the backdoor itself were too big for our limitation of up to 10 consecutive controlled bytes in each record. The idea we came up with was a simple decoder loop:

  1. The first records run in a loop and copy the controlled bytes from the rest of the records, and arrange them in a consecutive memory buffer.
  2. The rest of the records are the actual payload for our shellcode.

While the shellcode worked OK in our dummy environment, it encountered a few obstacles when we tried it on our real target.

First, the shellcode was expensive: it cost us 0x19 records; we originally sent only 0x10 records when testing the exploit for the heap-based vulnerability. This mere addition of only 9 records turned out to be too much: the bridge was too unstable and our third lightbulb failed to reach the ZCL Phase.

After a lot of calculations, we managed to squeeze our nice configurable shellcode into 0x12 records of a not-so-readable shellcode. We successfully bypassed this size limitation, and managed to start debugging our shellcode on the real target (using our remote gdbserver).

This is where we found the flaws in our initial plan. A decoder loop in a Mips architecture mandates that we call sleep() so that we won’t have any cache issues. Otherwise, our re-arranged records won’t propagate (be flushed) to the processors Instruction Cache (I-Cache), and effectively it executes some random garbage instead of our full shellcode. This sleep meant that we pretty much destroyed the target’s heap, and while we had our beauty sleep other threads were left to deal with our mess and crashed.

We couldn’t afford to enlarge our shellcode in our attempts to restore the program’s flow and avoid crashes, and it turns out that the ELF file wasn’t even writable during execution, so we had to devise a new plan for our shellcode.

Bold Shellcode Design

If we already need to restore the execution flow so that the program won’t crash during our sleep(), we might as well fully restore it and install a backdoor in our own memory address space. This way we don’t have to write to any file, and the lack of file path may remove the original need for an expensive decoder loop.

We went back to the drawing board, and after a few days managed to write a new shellcode that performs the following set of tasks (in order):

  1. Restore the execution flow: Stabilize the heap, restore the GOT, etc.
  2. Silence the watchdog: Make sure it won’t notice that we sent too many messages during our exploit (or at least make sure no one hears it).
  3. Install a backdoor: mprotect() a specific memory page to RWX permissions, and modify the needed bytes to incorporate our backdoor in the right place.

The second point was quite interesting, as it turned out that simply sending too many messages at a fast pace caused some threads to starve. When we finished, the watchdog saw that those threads were unresponsive, and exited the program alongside a nice syslog message that was sent to the vendor. This is probably the proper time to apologize to the vendors who might now think that something is wrong with one of their products, as it consistently sent them dozens of syslog reports.

Eventually, after some debugging, we had a working shellcode of 0x10 records. Figure 22 shows the memory layout of our shellcode, as is shown in IDA:

Figure 22: Memory layout of the final shellcode, as seen in IDA.

As you can see, the initial records hold the code to be executed, and the last 3 records store configuration variables including the data for the installed backdoor. Each code record executes a few assembly instructions and jumps ahead to the next record, until we finish all of our tasks and return to the original execution flow of the program.

Our Backdoor

We are not going to dive too deeply into the technical aspects of our backdoor, as we are not releasing a fully weaponized exploit to the public. What we can share is that our backdoor gave us a Write-What-Where primitive using a specially crafted message that we can now send to the target bridge from our “legitimate” lightbulb. We used this stable write primitive to write Scout’s loader to a RWX memory cave, and then used the fact that the code is still writable to redirect the execution to our new shellcode.

Scout’s loader simply connected back, over TCP, to our servers, received an executable to be dropped and deployed on the bridge, and executed it. In Figure 23 we can see the dropped /tmp/exploit process that executes the next stage of our attack.

Figure 23: Process list from the bridge, showing our malware is executed as root.

Using our brand new Mips target, we were able to extend Scout to support the Mips architecture, and it worked like a charm in our test case.

Combining the exploit parts

In our attack scenario, we want to take control of the bridge from the ZigBee network, and use it as a leverage point to attack additional computers in the IP network. But first, our vulnerability mandates that we trick the user into searching for new lightbulbs, which is not exactly an easy step. Using the attack primitives from the original research, we devised the following plan:

  1. Use the touchlink commissioning (used in the original research) and steal a lightbulb from the user’s network so that it will be now controlled by us.
  2. Change the lightbulb’s color and intensity to be any annoying color of your choice. The user must think that the lightbulb has a glitch but that it is still working, so don’t shut it down.
  3. Optional: Update the firmware of the lightbulb (as was done in the original research) and perform the next steps from the lightbulb itself. For simplicity, we used our evaluation board instead, as we didn’t want to brick any lightbulb in the process, and had no motivation to create a fully weaponized autonomous attack.
  4. The user eventually sees that something is wrong with the lightbulb. It appears as “Unreachable” in the mobile app. The user then “resets” it.
  5. The only way to reset the lightbulb is to delete it from the app, and then tell the bridge to search for new lightbulbs. Bingo! Now we can start our attack.
  6. The stolen lightbulb is in a different ZigBee network so it won’t be discovered by the bridge.
  7. We masquerade as a legitimate lightbulb that the user can see in the app, and reconfigure the lightbulb to use the original color.
  8. Behind the scenes, we create additional phantom lightbulbs that exploit the vulnerability in the bridge and install our backdoor.
  9. The “legitimate” lightbulb uses this backdoor to install malware on the targeted bridge.
  10. Our malware connects back to us through the internet, and we have now successfully infiltrated the target’s IP network from the ZigBee radio network.

For our demonstration, we chose to use the leaked NSA EternalBlue exploit, just as we did in our FAX research. The exploit is executed from the bridge itself, and is used to attack unpatched computers inside the target’s IP network.

Product Protection Notes

In the second part of the YouTube video, you can see an exploitation attempt on the same vulnerable Hue Bridge, when this time we installed on it our IoT nano agent. This nano-agent enforces Control-Flow-Integrity (CFI) and adds on-device protection to the firmware itself, thus successfully identifying and blocking our attack, even without familiarity with the exact 0-Day that we’ve exploited.

Check Point provides a consolidated security solution that hardens and protects the firmware of IoT devices. Utilizing a recently acquired technology, Check Point allows organizations to mitigate device level attacks before devices are compromised utilizing on-device run time protection. In addition to device-level security, Check Point offers network-level IoT protection by monitoring IoT traffic, identifying malicious communications or access attempts, and blocking them.

Special Thanks

This research was done with the help of the Check Point Institute for Information Security (CPIIS) at Tel Aviv University. And on a more personal note, with the help of an old colleague: Eyal Ronen (@eyalr0).

Coordinated Disclosure

  • 5 November 2019 – Vulnerabilities were disclosed to Philips, and forwarded to Signify.
  • 5 November 2019 – Signify acknowledged our report, and confirmed the existence of the vulnerability in their product.
  • 25 November 2019 – Signify notified us that they finished developing and testing the patch, and that it would be released in January 2020.
  • 13 January 2020 – The patch was deployed as a remote firmware update (1935144040).
  • 28 January 2020 – Due to geographical-based distribution, it took some time to our bridge to automatically install the firmware update.
  • 5 February 2020 – We released a demo video to raise awareness, and held the technical details until users have enough time to update their products.
  • 7 August 2020 – Full public disclosure during DEF CON 28.