String objects in Java are encoded in UTF-16. Java Platform is required to support other character encodings or charsets such as US-ASCII, ISO-8859-1, and UTF-8.  Errors may occur when converting between differently coded character data.  There are two general types of encoding errors. If the byte sequence is not valid for the specified charset then the input is considered malformed. If the byte sequence cannot be mapped to an equivalent character sequence then an unmappable character has been encountered.

According to the Java API  [API 2014] for the String constructors:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.

Similarly, the description of the String.getBytes(Charset) method states:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

The CharsetEncoder class is used to transform character data into a sequence of bytes in a specific charset.   The input character sequence is provided in a character buffer or a series of such buffers. The output byte sequence is written to a byte buffer or a series of such buffers.  The CharsetDecoder class reverses this process by transforming a sequence of bytes in a specific charset into character data.  The input byte sequence is provided in a byte buffer or a series of such buffers, while the output character sequence is written to a character buffer or a series of such buffers.

Special care should be taken when decoding untrusted byte data to ensure that malformed input or unmappable character errors do not result in defects and vulnerabilities.  Encoding errors can also occur, for example, encoding a cryptographic key containing malformed input for transmission will result in an error. Encoding and decoding errors typically result in data corruption. 

Noncompliant Code Example

This noncompliant code example is similar to the one used in STR03-J. Do not represent numeric data as strings in that it attempts to convert a byte array containing the two's-complement representation of this BigInteger value to a String. Because the byte array contains malformed-input sequences, the behavior of the String constructor is unspecified.

import java.math.BigInteger;
import java.nio.CharBuffer;

public class CharsetConversion {
  public static void main(String[] args) {
    BigInteger x = new BigInteger("530500452766");
    byte[] byteArray = x.toByteArray();
    String s = new String(byteArray);
    System.out.println(s);
  }
}

Compliant Solution

The java.nio.charset.CharsetEncoder and java.nio.charset.CharacterDecoder provide greater control over the process.  In this compliant solution, the CharsetDecode.decode() method is used to convert the byte array containing the two's-complement representation of this BigInteger value to a CharBuffer.  Because the bytes do not represent a valid UTF-16, the input is considered malformed, and a MalformedInputException is thrown.

import java.math.BigInteger;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.MalformedInputException;
import java.nio.charset.StandardCharsets;
import java.nio.charset.UnmappableCharacterException;

public class CharsetConversion {
  public static void main(String[] args) {
    CharBuffer charBuffer;
    CharsetDecoder decoder = StandardCharsets.UTF_16.newDecoder();
    BigInteger x = new BigInteger("530500452766");
    byte[] byteArray = x.toByteArray();
    ByteBuffer byteBuffer = ByteBuffer.wrap(byteArray);
    try {
      charBuffer = decoder.decode(byteBuffer);
      s = charBuffer.toString();
      System.out.println(s);
    } catch (IllegalStateException e) {
      e.printStackTrace();
    } catch (MalformedInputException e) {
      e.printStackTrace();
    } catch (UnmappableCharacterException e) {
      e.printStackTrace();
    } catch (CharacterCodingException e) {
      e.printStackTrace();
    }
  }
}

Risk Assessment

Malformed input or unmappable character errors can result in a loss of data integrity.

Rule

Severity

Likelihood

Remediation Cost

Priority

Level

STR05-J

low

unlikely

medium

P2

L3

Related Guidelines

MITRE CWE

CWE-838. Inappropriate Encoding for Output Context

 

CWE-116. Improper Encoding or Escaping of Output

Bibliography

 

11 Comments

  1. The compliant solution implicitly uses the platform default character encoding twice. Generally it is better to specify an encoding explicitly (even if you do want the platform default). This also goes for the likes of locale, timezone, etc. In this particular example, the platform default character encoding may not contain latin characters, and these byte representation is arbitrary even if it does.

    1. Oops, yes, that's exactly what FIO03-J. "Specify the character encoding while performing file or network IO" says!

      Thanks.

  2. Instead of using "Some Arbitrary String" you can use the same BigInteger as the NCE and use the toString() method to convert it to a String and then do the bytearray stuff. The CS can then actually address the problem described in the NCE, rather than doing something which is not quite related to the NCE.

    EDIT: I've incorporated this change.

  3. In NCCE, the conversion from byte array to String (new String(byteArray)) depends on the default charset.

    for example, NCCE works ok on my PC(windows7 64bit, jdk-7-fcs-b147-x64), the default charset is windows-31j.

    how about changing new String(byteArray) to new String(byteArray, "US-ASCII") ?

    1. Well, that would make the noncompliant solution less noncompliant wouldn't it? (smile)
      Actually, I got different results when I ran the program, so I added those results in, along with the fact that the default encoding affects the results.

      1. using String(byteArray,"US-ASCII"), I got 529342807871 as a converted back BigInteger, which is clearly environment-dependent (-:

        Java SE API 6 (and 7) says

        The behavior of this constructor when the given bytes are not valid in the default charset is unspecified.

        in the following text, the value of s looks garbage on my browser.

        When run on a platform where the default character encoding is US-ASCII, the string s gets the value {{{J}}, because some of the characters are unprintable. When converted back to a BigInteger, x gets the value 149830058370101340468658109.

        so, how about replacing that paragraph with the following text?

        When run on a platform where the default character encoding is US-ASCII, the string s includes some unprintable characters. When converted back to a BigInteger, x gets the value 149830058370101340468658109.

  4. I've hit this problem twice in 10 years, I was expecting it to be higher than 'unlikely'.

    One instance was storing the binary values of encrypted passwords as Strings in the DB, and then wondering why people were complaining about being locked out when we shifted DB OS.

    1. Just to clarify, the "unlikely" means how likely is it that a flaw introduced by violating the rule could lead to an exploitable vulnerability (see Priority and Levels).  It is not meant to indicate how common the defect is.  This information is used to decide which prioritize repairs to the code. 

      Let me know if you still believe this should be changed.  Your example above seemed exceptionally secure. 8^)

      1. OK, agreed on the probability.

        Unfortunately it wasn't an 'example' it was a real life project for a government agency. 

  5. I'm very confused by the first example.  As far as I can tell, this example has nothing to do with specifying a valid character encoding.  The problem is that in the NCE, the BigInteger is converted to binary and in the CS the BigInteger is converted to a String.

    Perhaps there are to separate rules here?  The rule that goes with this example is probably "Don't represent binary values as Strings".  I would think this would be pretty obvious, but A Bishop has seen it twice in 10 years.  I do sort of like his example, if we could code it up.

     

  6. AFAICT, the NCE and CS perform the same actions as the default behavior of CharsetEncoder seems to be the same or similar to getByte().  Right now I'm thinking this would be best as a guideline which says something like "Use the CharsetEncoder class when more control over the encoding process is required."