FIO02-J. Keep track of bytes read and account for character encoding while reading data

According to the Java API [API 2006] for the class InputStream, with an array of bytes b as the parameter, the read(b) method

Reads some number of bytes from the input stream and stores them into the buffer array b. The number of bytes actually read is returned as an integer. The number of bytes read is, at most, equal to the length of b.

Note that the read() methods return as soon as they find available input data. Ignoring the result returned by the read() methods is a violation of guideline EXP00-J. Do not ignore values returned by methods. Even when return values are not ignored, security issues can arise because by default, none of the methods guarantee that all the requested bytes will be read. It is left to the programmer to check the number of bytes read and call the read() method again as required.

Failure to handle multibyte encoded data is another source of data read errors. Multibyte encodings such as UTF-8 are used for character sets that require more than one byte to uniquely identify each constituting character. For example, the Japanese encoding Shift-JIS (shown below), supports multibyte encoding wherein the maximum character length is two bytes (one leading and one trailing byte).

Byte Type	Range
single-byte	0x00 through 0x7F and 0xA0 through 0xDF
lead-byte	0x81 through 0x9F and 0xE0 through 0xFC
trailing-byte	0x40-0x7E and 0x80-0xFC

The trailing byte ranges overlap the range of both the single byte and lead byte characters. This can cause issues because if a multibyte character is separated between buffer boundaries, it will be interpreted differently, as defined by its composing bytes [Phillips 2005].

A third data reading issue arises from the behavior of the String class constructor with respect to the default encoding. See guideline FIO03-J. Specify the character encoding while performing file or network IO for more details.

Noncompliant Code Example

This noncompliant code example attempts to read a specific number of bytes from a FileInputStream but suffers from a few pitfalls. The objective is to read 1024 bytes and return them as a String. Unfortunately, this may not happen because of the general contract of the read() methods.

public static String readBytes(FileInputStream in) throws IOException {
  String str = "";
  byte[] data = new byte[1024];
  while (in.read(data) > -1) {
    str += new String(data);
  }
  return str;
}

A second issue involves multibyte character encoding. It is possible for the read() method to read data from the stream terminating the String buffer str with the leading byte of a multibyte character and in the next iteration reading the trailing bytes. This is because when the bytes are concatenated to str, the multibyte encoding information is lost as it does not extend across buffer boundaries.

Finally, the buffer str contains data represented by the default encoding of the system because no specific encoding is specified in the call to the String class constructor.

Compliant Solution (1)

This compliant solution accounts for the total number of bytes read (and adjusts the remaining bytes' offset) so that the required data is fully read. It also specifies the String encoding explicitly to facilitate portability across systems that use different default encodings.

public static String readBytes(FileInputStream in) throws IOException {
  int offset = 0;
  int bytesRead = 0;
  byte[] data = new byte[1024];
  while (true) { 
    bytesRead += in.read(data, offset, data.length - offset);
    if (bytesRead == -1 || offset >= data.length)
      break;
    offset += bytesRead;
  }
  String str = new String(data, "UTF-8");
  return str;
}

The size of the data byte buffer depends on the maximum number of bytes required to write an encoded character. For example, UTF-8 encoded data requires a maximum of three bytes to denote one character. Although it seems counter intuitive, any character above U+FFFF requires a maximum of four bytes. However, such a sequence is split into two separate char values of two bytes each as Java internally uses UTF-16 for representing a char. Consequently, the buffer size should be four times the size of a typical byte sequence.

Compliant Solution (2)

The no argument and one argument readFully() methods of the DataInputStream class can be used to read all the requested data. An EOFException is thrown if the end of input is detected before the required number of bytes have been read, and an IOException is thrown if some other input/output error occurs. The exception handler decides the way forward.

public static String readBytes(FileInputStream fis) throws IOException {
  byte[] data = new byte[1024];
  DataInputStream dis = new DataInputStream(fis);
  dis.readFully(data);
  String str = new String(data, "UTF-8");
  return str;
}

Risk Assessment

Failure to comply with this guideline can result in the wrong number of bytes being read or character sequences being interpreted incorrectly.

Guideline	Severity	Likelihood	Remediation Cost	Priority	Level
FIO02-J	low	unlikely	medium	P2	L3

Automated Detection

TODO

Related Vulnerabilities

Search for vulnerabilities resulting from the violation of this rule on the CERT website.

References

[[API 2006]] Class InputStream, DataInputStream
[[Phillips 2005]]
[[Harold 1999]] Chapter 7: Data Streams, Reading Byte Arrays
[[Chess 2007]] 8.1 Handling Errors with Return Codes
[[MITRE 2009]] CWE ID 135 "Incorrect Calculation of Multi-Byte String Length"

FIO01-J. Do not expose buffers created using the wrap() or duplicate() methods to untrusted code 09. Input Output (FIO) FIO03-J. Specify the character encoding while performing file or network IO

Space shortcuts

Page tree