FIO02-J. Keep track of bytes read and account for character encoding while reading data

According to the Java API [API 2006] for the class InputStream, with an array of bytes b as the parameter, the read(b) method

Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer. The number of bytes read is, at most, equal to the length of b.

Note that the read() methods return as soon as they find available input data. Ignoring the result returned by the read() methods is a violation of guideline EXP00-J. Do not ignore values returned by methods. Security issues can arise even when return values are considered, because the default behavior of the read() methods lacks any guarantee that the entire buffer array will be filled. The programmer must check the number of bytes actually read and call the read() method again as required.

Another source of data read errors is failure to correctly handle multibyte encoded data. Multibyte encodings such as UTF-8 are used for character sets that require more than one byte to uniquely identify each constituent character. For example, the Japanese encoding Shift-JIS (shown below), supports multibyte encoding wherein the maximum character length is two bytes (one leading and one trailing byte).

Byte Type	Range
single-byte	0x00 through 0x7F and 0xA0 through 0xDF
lead-byte	0x81 through 0x9F and 0xE0 through 0xFC
trailing-byte	0x40-0x7E and 0x80-0xFC

The trailing byte ranges overlap the range of both the single byte and lead byte characters. When a multibyte character is separated across a buffer boundary, it can be interpreted differently than it if were not separated across the buffer boundary; this difference arises due to the ambiguity of its composing bytes [Phillips 2005].

A third data reading issue arises from the behavior of the String class constructor with respect to the default encoding. See guideline FIO03-J. Specify the character encoding while performing file or network IO for more details.

Noncompliant Code Example

This noncompliant code example attempts to read 1024 bytes from a FileInputStream and to return them as a String.

public static String readBytes(FileInputStream in) throws IOException {
  String str = "";
  byte[] data = new byte[1024];
  while (in.read(data) > -1) {
    str += new String(data);
  }
  return str;
}

This noncompliant code example can fail in several different ways. First, the programmer's misunderstanding of the general contract of the read() methods can result in failure to read the intended data in full. Second, the code fails to consider the interaction between characters represented with a multi-byte encoding and the boundaries between the loop iterations. When the last byte read from the data stream is the leading byte of a multibyte character, the trailing bytes will not be encountered until the next iteration of the while loop. However, multi-byte encoding is resolved during construction of the new String within the loop. Consequently, the multibyte encoding will be interpreted incorrectly in this case. Finally, because no specific character encoding is specified in the call to the String class constructor, the constructor uses the system default character encoding to interpret the bytes in the buffer. If the input used a character encoding that differs from the system's default character encoding, the resulting string can be corrupted.

Compliant Solution (Multiple calls to read)

This compliant solution reads all the desired bytes into its buffer, accounting for the total number of bytes read and adjusting the remaining bytes' offset, thus ensuring that the required data are read in full. It avoids splitting multibyte encoded characters across buffers by deferring construction of the result string until the data have been read in full. It also facilitates portability across systems that use different default character encodings by specifying an explicit character encoding for the String constructor.

public static String readBytes(FileInputStream in) throws IOException {
  int offset = 0;
  int bytesRead = 0;
  byte[] data = new byte[1024];
  while (true) { 
    bytesRead += in.read(data, offset, data.length - offset);
    if (bytesRead == -1 || offset >= data.length)
      break;
    offset += bytesRead;
  }
  String str = new String(data, "UTF-8");
  return str;
}

The size of the data byte buffer depends on the maximum number of bytes required to write an encoded character. For example, UTF-8 encoded data requires four bytes to represend any character above U+FFFF. Because Java uses the UTF-16 character encoding to represent char data, such sequences are split into two separate char values of two bytes each. Consequently, the buffer size should be four times the size of a typical byte sequence.

Compliant Solution (readFully)

The no-argument and one-argument readFully() methods of the DataInputStream class guarantee that they either will read all of the requested data or will throw an exception. These methods throw EOFException if they detect the end of input before the required number of bytes have been read; they throw IOException if some other input/output error occurs. This compliant solution also specifies an explicit character encoding to the String constructor.

public static String readBytes(FileInputStream fis) throws IOException {
  byte[] data = new byte[1024];
  DataInputStream dis = new DataInputStream(fis);
  dis.readFully(data);
  String str = new String(data, "UTF-8");
  return str;
}

Risk Assessment

Failure to comply with this guideline can result in the wrong number of bytes being read or character sequences being interpreted incorrectly.

Guideline	Severity	Likelihood	Remediation Cost	Priority	Level
FIO02-J	low	unlikely	medium	P2	L3

Automated Detection

TODO

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this guideline on the CERT website.

Bibliography

[[API 2006]] Class InputStream, DataInputStream
[[Chess 2007]] 8.1 Handling Errors with Return Codes
[[Harold 1999]] Chapter 7: Data Streams, Reading Byte Arrays
[[MITRE 2009]] CWE ID 135 "Incorrect Calculation of Multi-Byte String Length"
[[Phillips 2005]]

FIO01-J. Do not expose buffers created using the wrap() or duplicate() methods to untrusted code 09. Input Output (FIO) FIO03-J. Specify the character encoding while performing file or network IO

Space shortcuts

Page tree